| |
DATE 2007 ABSTRACTS
Sessions:
[Keynote Addresses]
[1.2]
[IP1]
[1.3]
[1.4]
[1.5]
[IP2]
[1.6]
[IP3]
[1.7]
[IP4]
[2.2]
[IP5]
[2.3]
[IP6]
[2.4]
[2.5]
[IP7]
[2.6]
[IP8]
[2.7]
[IP9]
[3.2]
[IP10]
[3.3]
[3.4]
[IP11]
[3.5]
[3.6]
[3.7]
[IP12]
[4.1]
[4.2]
[IP13]
[4.3]
[IP14]
[4.4]
[IP15]
[4.5]
[IP16]
[4.6]
[IP17]
[4.7]
[IP18]
[5.1.1]
[5.1.2]
[5.2]
[5.3]
[IP19]
[5.4]
[5.5]
[IP20]
[5.6]
[IP21]
[5.7]
[IP22]
[6.1]
[6.2]
[6.3]
[IP23]
[6.4]
[IP24]
[6.5]
[IP25]
[6.6]
[6.7]
[7.1]
[7.2]
[IP26]
[7.4]
[7.5]
[IP27]
[7.6]
[IP28]
[7.7]
[IP29]
[8.1]
[8.2]
[8.3]
[IP30]
[8.4]
[IP31]
[8.5]
[IP32]
[8.6]
[8.7]
[9.1.1]
[9.1.2]
[9.2]
[9.3]
[IP33]
[9.4]
[IP34]
[9.5]
[IP35]
[9.6]
[9.7]
[IP36]
[10.1]
[10.2]
[IP37]
[10.3]
[IP38]
[10.4]
[10.5]
[10.6]
[IP39]
[10.7]
[IP40]
[11.1]
[11.2]
[11.3]
[11.4]
[11.5]
[11.6]
[11.7]
-
Challenges of Digital Consumer and Mobile SOC's: More Moore Possible? [p. 1]
-
T. Furuyama
Digital consumer and mobile products have continuously accommodated more features and functions.
For example, the recent high-end cellular phones can operate as terrestrial digital TV viewers, MP3 music players,
digital cameras, substitutes of credit cards and many more in addition to multi-modal wireless communication
terminals that handle various formats; GSM, 3G, BT, WiFi and so on. These products require to best combine highly
integrated SoC's and sophisticated software stacks in a timely manner. It is essential to establish a
hardware/software co-development/verification environment with an ESL design methodologies and an IP reuse
platform where various functions are realised on an SoC by legacy sub-systems with a low-power multi-processor
architecture. This challenge gets more complicated in deep sub-100 nm technology nodes. Approaches to these
complex problems from different aspects will be presented.
-
Was Darwin Wrong? Has Design Evolution Stopped at the RTL Level...or Will Software and Custom Processors (or System-Level Design) Extend Moore's Law? [p. 2]
-
A. Naumann
The challenges of electronic design are escalating as software and embedded processors are fast
becoming a more dominant component of electronic products. Software is now acknowledged as the most effective
way for electronics companies to differentiate their products. But what if the processors running the software aren't
up to the task? Electronics companies are increasingly adopting a new system-level design methodology to stay
competitive, one that enables design that is centred on custom processors and software. The ripple effects of systemlevel
design are even affecting the way that semiconductor companies take products to market and how their
customers choose and use silicon.
Moderators: G. De Micheli, EPF Lausanne, CH, P. van der Wolf, NXP Semiconductors Research, NL
-
ATLAS: A Chip-Multiprocessor with Transactional Memory Support [p. 3]
-
N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis and K. Olukotun
Chip-multiprocessors are quickly becoming popular in
embedded systems. However, the practical success of CMPs
strongly depends on addressing the difficulty of multithreaded
application development for such systems. Transactional
Memory (TM) promises to simplify concurrency
management in multithreaded applications by allowing programmers
to specify coarse-grain parallel tasks, while
achieving performance comparable to fine-grain lock-based
applications.
This paper presents ATLAS, the first prototype of a
CMP with hardware support for transactional memory. ATLAS
includes 8 embedded PowerPC cores that access coherent
shared memory in a transactional manner. The data
cache for each core is modified to support the speculative
buffering and conflict detection necessary for transactional
execution. We have mapped ATLAS to the BEE2 multi-FPGA
board to create a full-system prototype that operates
at 100MHz, boots Linux, and provides significant performance
and ease-of-use benefits for a range of parallel applications.
Overall, the ATLAS prototype provides an excellent
framework for further research on the software and
hardware techniques necessary to deliver on the potential
of transactional memory.
-
A Dynamically Adaptive DSP for Heterogeneous Reconfigurable Platforms[p. 9]
-
F. Campi, A. Deledda, M. Pizzotti, L. Ciccarelli, P. Rolandi, C. Mucci, A. Lodi, A. Vitkovski
and L. Vanzolini
This paper describes a digital signal processor based on
a multi-context, dynamically reconfigurable datapath, suitable
for inclusion as an IP-block in complex SoC design
projects. The IP was realized in CMOS 090 nm technology.
The most relevant features offered by the proposed
architecture with respect to state of the art are zero overhead
for switching between successive configurations, relevant
area and energy computational density on computational
kernels (average of 2 GOPS/mm2, 0.2GOPS/mW)
and relatively small area occupation (18 mm2), making it
suitable for acceleration or upgrade of multi-core heterogeneous
embedded platforms. The processor is delivered with
a software tool chain providing the application developer
algorithmic analysis and design space exploration based on
ANSI C, with no utilization of hardware-related constructs
or description languages.
-
An 0.9 X 1.2", Low Power, Energy-Harvesting System with Custom Multi-Channel
Communication Interface [p. 15]
-
P. Stanley-Marbell and D. Marculescu
Presented is a self-powered computing system, Sunflower,
that uses a novel combination of a PIN photodiode array,
switching regulators, and a supercapacitor, to provide
a small footprint renewable energy source. The design
provides software-controlled power-adaptation facilities,
for both the main processor and its peripherals. The system's
power consumption is characterized, and its energyscavenging
efficiency is quantified with field measurements
under a variety of weather conditions.
-
An FPGA Based All-Digital Transmitter with Radio Frequency Output for Software Defined Radio [p. 21]
-
Z. Ye, J. Grospietsch, G. Memik
In this paper, we present the architecture and implementation of
an all-digital transmitter with radio frequency output targeting an
FPGA device. FPGA devices have been widely adopted in the
applications of digital signal processing (DSP) and digital
communication. They are typically well suited for the evolving
technology of software defined radios (SDR) due to their
reconfigurability and programmability. However, FPGA devices
are mostly used to implement digital baseband and intermediate
frequency (IF) functionalities. Therefore, significant analog and
RF components are still needed to fulfill the radio communication
requirements. The all-digital transmitter presented in this paper
directly synthesizes RF signal in the digital domain, therefore
eliminates the need for most of the analog and RF components.
The all-digital transmitter consists of one QAM modulator and
one RF pulse width modulator (RFPWM). The binary output
waveform from RFPWM is centered at 800MHz with 64QAM
signaling format. The entire transmitter is implemented using
Xilinx Virtex2pro device with on chip multi-gigabit transceiver
(MGT). The adjacent channel leakage ratio (ACLR) measured in
the 20 MHz passband is 45dB, and the measured error vector
magnitude (EVM) is less than 1%. Our work extends the digital
implementation of communication applications on an FPGA
platform to radio frequency, therefore making a significant
evolution towards an ideal SDR.
Moderators: S. Kundu, Massachusetts U, US, H.-J. Wunderlich, Stuttgart U, DE
-
A Non-Intrusive Isolation Approach for Soft Cores [p. 27]
-
O. Sinanoglu and T. Petrov
Cost effective SOC test strongly hinges on parallel, independent test
of SOC cores, which can only be ensured through proper core isolation
techniques. While a core isolation mechanism can provide controllability
and observability at the core I/O interface, its implementation
may have various implications on area, functional timing, test
time and data volume, and at-speed coverage on the core interface. In
this paper, we propose a non-intrusive core isolation technique that
is based on the utilization of existing core registers for isolating the
core. We provide a core register partitioning algorithm that is capable
of identifying the core interface registers, and of robustly isolating
a core, resulting in a computationally efficient core isolation implementation
that is area and performance efficient at the same time. The
proposed isolation technique also ensures minimal test time increase
and no at-speed coverage loss on the core interface, offering an elegant
solution for soft cores, and thus enabling significant SOC test
cost reductions.
-
Unknown Blocking Scheme for Low Control Data Volume and High Observability [p. 33]
-
S. Wang, W. Wei, S.T. Chakradhar
This paper presents a new blocking logic to block unknowns
for temporal compactors. The proposed blocking
logic can reduce data volume required to control the blocking
logic and also increase the number of scan cells that are observed
by the temporal compactors. Control patterns, which
describe values required at the control signals of the blocking
logic, are compressed by LFSR reseeding. In this paper,
the blocking logic gates for some groups of scan chains that
do not capture unknowns are bypassed. Since all the scan
cells in these scan chain groups are observed without specifying
the corresponding bits in control patterns, fewer specified
bits are required and more scan cells are observed. The
seed size is further reduced by reducing numbers of specified
bits in the densely specified control patterns. The proposed
method can always achieve the same fault coverage that can
be achieved by direct observation of scan chains. Experiments
with large industrial designs clearly demonstrate that
the proposed method is scalable to large circuits. Hardware
overhead for the proposed blocking logic is very low.
-
Test Cost Reduction for SoC Using a Combined Approach to Test Data Compression and
Test Scheduling [p. 39]
-
Q. Zhou and K.J. Balakrishnan
A combined approach for implementing system level test compression
and core test scheduling to reduce SoC test costs is proposed
in this paper. A broadcast scan based test compression algorithm
for parallel testing of cores with multiple scan chains is used
to reduce the test data of the SoC. Unlike other test compression
schemes, the proposed algorithm doesn't require specialized test
generation or fault simulation and is applicable with intellectual
property (IP) cores. The core testing schedule with compression
enabled is decided using a generalized strip packing algorithm.
The hardware architecture to implement the proposed scheme is
very simple. By using the combined approach, the total test data
volume and test application time of the SoC is reduced to a level
comparable with the test data volume and test application time of
the largest core in the SoC.
-
High-Level Test Synthesis for Delay Fault Testability [p. 45]
-
S.-J. Wang and T.-H. Yeh
A high-level test synthesis (HLTS) method targeted for
delay fault testability is presented. The proposed method,
when combined with hierarchical test pattern generation
for embedded modules, guarantees 100% delay test
coverage for detectable faults in modules. A study on the
delay testability problem in behavior level shows that low
delay fault coverage is usually attributed to the fact that
two-pattern test for delay testing cannot be delivered to
modules under test in consecutive cycles. To solve the
problem, we propose an HLTS method that ensures valid
test pairs can be sent to each module through synthesized
circuit hierarchy. Experimental results show that this
method achieves 100% fault coverage for transition faults
in functional units, while the fault coverage in circuits
synthesized by LEA-based allocation algorithm is rather
poor. The area overhead due to this method ranges from
2% to 10% for 16-bit datapaths.
Moderators: J. Teich, Erlangen-Nuremberg U, DE, M. Heijligers, NXP IC-Lab, NL
-
Bus Access Optimisation for FlexRay-based Distributed Embedded Systems [p. 51]
-
T. Pop, P. Pop, P. Eles and Z. Peng
FlexRay will very likely become the de-facto standard for in-vehicle
communications. Its main advantage is the combination of high speed
static and dynamic transmission of messages. In our previous work we
have shown that not only the static but also the dynamic segment can
be used for hard-real time communication in a deterministic manner. In
this paper, we propose techniques for optimising the FlexRay bus access
mechanism of a distributed system, so that the hard real-time
deadlines are met for all the tasks and messages in the system. We have
evaluated the proposed techniques using extensive experiments.
-
A Decomposition-based Constraint Optimization Approach for Statically Scheduling Task
Graphs with Communication Delays to Multiprocessors [p. 57]
-
N. Satish, K. Ravindran and K. Keutzer
We present a decomposition strategy to speed up constraint optimization
for a representative multiprocessor scheduling problem.
In the manner of Benders decomposition, our technique solves relaxed
versions of the problem and iteratively learns constraints to
prune the solution space. Typical formulations suffer prohibitive
run times even on medium-sized problems with less than 30 tasks.
Our decomposition strategy enhances constraint optimization to
robustly handle instances with over 100 tasks. Moreover, the extensibility
of constraint formulations permits realistic application
and resource constraints, which is a limitation of common heuristic
methods for scheduling. The inherent extensibility, coupled
with improved run times from a decomposition strategy, posit constraint
optimization as a powerful tool for resource constrained
scheduling and multiprocessor design space exploration.
-
Design Closure Driven Delay Relaxation Based on Convex Cost Network Flow [p. 63]
-
C. Lin, A. Xie and H. Zhou
Design closure becomes hard to achieve at physical layout
stage due to the emergence of long global interconnects.
Consequently, interconnect planning needs to be integrated
in high level synthesis. Delay relaxation that assigns extra
clock latencies to functional resources at RTL (Register
Transfer Level) can be leveraged. In this paper we propose
a general formulation for design closure driven delay relaxation
problem. We show that the general formulation can
be transformed into a convex cost integer dual network flow
problem and solved in polynomial time using the convex
cost-scaling algorithm in [1]. Experimental results validate
the efficiency of the approach.
Moderators: F. V. Fernandez, IMSE, CSIC and Seville U, ES, L. Hedrich, Frankfurt/M U, DE
-
Simulation-based Reusable Posynomial Models for MOS Transistor Parameters [p. 69]
-
V. Aggarwal and U.-M. O'Reilly
We present an algorithm to automatically design posynomial
models for parameters of the MOS transistors using
simulation data. These models improve the accuracy of the
Geometric Programming flow for automatic circuit sizing.
The models are reusable for multiple circuits on a given Silicon
technology and hence don't adversely affect the scalability
of the Geometric Programming approach. The proposed
method is a combination of genetic algorithms and
Quadratic Programming. It is the only approach for posynomial
modeling with real-valued exponents which is easily
extensible to different error metrics. We compare the
proposed technique with state-of-art posynomial/monomial
modeling techniques and show its superiority.
-
Trade-Off Design of Analog Circuits Using Goal Attainment and "Wave Front" Sequential Quadratic
Programming [p. 75]
-
D. Mueller, H. Graeb and U. Schlichtmann
One of the main tasks in analog design is the sizing of
the circuit parameters, such as transistor lengths and
widths, in order to obtain optimal circuit performances,
such as high gain or low power consumption. In most
cases one performance can only be optimized at cost of
others, therefore a sizing must aim at an optimal trade-off
between the important circuit performances.
In this paper we present a new deterministic method to
calculate the complete range of performance trade-offs,
the so-called Pareto-optimal front, of a given circuit
topology. Known deterministic methods solve a set of
constrained multi-objective optimization problems
independently of each other. The presented method
minimizes a set of Goal Attainment (GA) optimization
problems simultaneously. In a parallel algorithm, the
individual GA optimization processes compare and
exchange their iterative solutions. This leads to a
significant improvement in the efficiency and quality of
analog trade-off design.
-
An Efficient Methodology for Hierarchical Synthesis of Mixed-Signal Systems with Fully Integrated
Building Block Topology Selection [p. 81]
-
T. Eeckelaert, R. Schoofs, G. Gielen, M. Steyaert and W. Sansen
An hierarchical synthesis methodology for analog and
mixed-signal systems is presented that fully in a novel way integrates
topology selection at all levels. A hierarchical system
optimizer takes multiple topologies for all the building blocks
at each hierarchical abstraction level, and generates optimal
topology combinations using multi-objective evolutionary
optimization techniques. With the presented methodology,
system-level performance trade-offs can be generated where
each design point contains valuable information on how the
systems performances are influenced by different combinations
of lower-level building block topologies. The generated
system designs can contain all kinds of topology combinations
as long as critical inter-block constraints are met. Different
topologies can be assigned to building blocks with the same
functional behavior, leading to more optimal hybrid designs
than typically obtained in manual designs. In the experimental
results, three different integrator topologies are used to
generate an optimal system-level exploration trade-off for a
complex high-speed ΔΣ A/D modulator.
-
A Coefficient Optimization and Architecture Selection Tool for ∑Δ Modulators in MATLAB [p. 87]
-
O. Yetik, O. Saglamdemir, S. Talay and G. Dündar
A tool created in MATLAB environment for automatic
transfer function generation and topology synthesis for a
Sigma Delta Modulator for a desired frequency response
will be proposed in this work. The tool carries out two basic
tasks: (1) transfer function generation, which works in
a SPICE like fashion, taking the netlist of an arbitrary SD
modulator architecture in block level as the input, determining
the input-output relation for each block in z-domain and
generating the signal and noise transfer functions (STF
and NTF) of the system automatically, (2) a topology synthesis
algorithm which uses the STF and NTF as inputs
and finds all the possible SD modulator topologies (according
to some criteria such as minimization of the number of
signal paths) which can be obtained from the architecture
and which realizes a desired frequency response. The application
of the tool will be illustrated on examples.
Moderators: P. van der Wolf, NXP Semiconductors Research, NL, L. Thiele, ETH Zurich, CH
-
(694) Synthesis of Task and Message Activation Models in Real-Time Distributed Automotive Systems [p. 93]
-
W. Zheng, M. Di Natale, C. Pinello, P. Giusto and A. Sangiovanni Vincentelli
Modern automotive architectures support the execution
of distributed safety- and time-critical functions on a complex
networked system with several buses and tens of ECUs.
Schedulability theory allows the analysis of the worst case
end-to-end latencies and the evaluation of the possible architecture
configurations options with respect to timing constraints.
We present an optimization framework, based on
an ILP formulation of the problem, to select the communication
and synchronization model that leverages the trade-offs
between the purely periodic and the precedence constrained
data-driven activation models to meet the latency and jitter
requirements of the application. We demonstrate its effectiveness
by optimizing a complex automotive architecture.
-
(287) An ILP Formulation for System-Level Application Mapping on Network Processor Architectures [p. 99]
-
C. Ostler and K.S. Chatha
Current day network processors incorporate several architectural
features including symmetric multi-processing (SMP), block
multi-threading, and multiple memory elements to support the high
performance requirements of networking applications. We present
an automated system-level design technique for application development
on such architectures. The technique incorporates process
transformations and block multi-threading aware data mapping
to maximize the worst case throughput of the application.
We propose integer linear programming formulations for process
allocation and data mapping on SMP and block multi-threading
based network processors. The paper presents experimental results
that evaluate the technique by implementing representative
network processing applications on the Intel IXP 2400 architecture.
The results demonstrate that our technique is able to generate
high-quality mappings of realistic applications on the target
architecture within a short time.
-
(231) A Smooth Refinement Flow for Co-Designing HW and SW Threads [p. 105]
-
P. Destro, F. Fummi and G. Pravadelli
Separation of HW and SW design flows represents a critical aspect
in the development of embedded systems. Co-verification becomes
necessary, thus implying the development of complex cosimulation
strategies. This paper presents a refinement flow that delays
as much as possible the separation between HW and SW concurrent
entities (threads), allowing their differentiation, but preserving
an homogeneous simulation environment. The approach
relies on SystemC as the unique reference language. However, SystemC
threads, corresponding to the SW application, are simulated
outside the control of the SystemC simulation kernel to exploit the
typical features of multi-threading real-time operating systems running
on embedded systems. On the contrary HW threads maintain
the original simulation semantics of SystemC. This allows designers
to effectively tune the SW application before HW/SW partitioning,
leaving to an automatic procedure the SW generation, thus
avoiding error-prone and time-consuming manual conversions.
-
(521) Speeding Up SystemC Simulation through Process Splitting [p. 111]
-
Y. N. Naguib and R. S. Guindi
This paper presents a new approach that can be used
to speed up SystemC simulations by automatically
optimizing the model for simulation. The work addresses
the inefficiency of the standard SystemC scheduler that
may lead in some situations to unnecessary wake-up calls,
as well as unnecessary code execution. The method
presented analyzes the SystemC code to automatically
extract signal dependencies based on a set of rules. This
information is then used to split large processes into
smaller ones. Process splitting is performed by a tool -
SplitPro- which generates an optimized code that can be
run on any standard SystemC engine. SplitPro was used to
analyze the description of an Alpha super scalar processor
and optimize some of its modules. A speed gain of up to
23% in simulation time was achieved over a number of
split processes.
-
(394) An FPGA Design Flow for Reconfigurable Network-Based Multi-Processor Systems on Chip [p. 117]
-
A. Kumar, A. Hansson, J. Huisken and H. Corporaal
Multi-Processor System on Chip (MPSoC) platforms are
becoming increasingly more heterogeneous and are shifting
towards a more communication-centric methodology. Networks
on Chip (NoC) have emerged as the design paradigm
for scalable on-chip communication architectures. As the
system complexity grows, the problem emerges as how to
design and instantiate such a NoC-based MPSoC platform
in a systematic and automated way.
In this paper we present an integrated flow to automatically
generate a highly configurable NoC-based MPSoC for
FPGA instantiation. The system specification is done on a
high level of abstraction, relieving the designer of errorprone
and time consuming work. The flow uses the state-of-the-art
Æthereal NoC, and Silicon Hive processing cores,
both configurable at design- and run-time.
We use this flow to generate a range of sample designs
whose functionality has been verified on a Celoxica
RC300E development board. The board, equipped with a
Xilinx Virtex II 6000, also offers a huge number of peripherals,
and we show how their insertion is automated in the
design for easy debugging and prototyping.
Moderators: W. Najjar, UC Riverside, US, F. Kurdahi, UC Irvine, US
-
Hard Real-Time Reconfiguration Port Scheduling [p. 123]
-
F. Dittmann and S. Frank
When modern partially and dynamically reconfigurable
FPGAs are to be used as resources in hard real-time systems,
the two dimensions area and time have to be considered
in the focus of availability and deadlines. In particular,
area requirements must be guaranteed for the tasks'
duration. While execution environments that abstract the
space demand of tasks exist and methods for occupancy
of resources over time are discussed in the literature, few
works focus on another fundamental bottleneck, the reconfiguration
port. As all resource requests are served by this
mutually exclusive device, profound concepts for scheduling
the port access are vital requirements for FPGA realtime
scheduling. Nevertheless, as the port must be accessed
sequentially, we can inherit and apply monoprocessor
scheduling concepts that are well researched. In this
paper, we introduce monoprocessor scheduling algorithms
for the reconfiguration port of FPGAs.
-
An Efficient Algorithm for Online Management of 2D Area of Partially Reconfigurable FPGAs [p. 129]
-
J. Cui, Q. Deng, X. He and Z. Gu
Partially Runtime-Reconfigurable (PRTR) FPGAs allow
hardware tasks to be placed and removed dynamically at runtime.
We present an efficient algorithm for finding the complete
set of maximal empty rectangles on a 2D PRTR FPGA, which
is useful for online placement and scheduling of HW tasks. The
algorithm is incremental and only updates the local region affected
by each task addition or removal event. We use simulation
experiments to evaluate its performance and compare to
related work.
-
Improving Utilization of Reconfigurable Resources Using Two-Dimensional Compaction [p. 135]
-
A.A. El Farag, H.M. El-Boghdadi and S.I. Shaheen
Partial reconfiguration allows parts of the
reconfigurable chip area to be configured without affecting
the rest of the chip. This allows placement of tasks at run
time on the reconfigurable chip. Area management is a very
important issue which highly affect the utilization of the
chip and hence the performance.
This paper focuses on a major aspect of moving running
tasks to free space for new incoming tasks (compaction).
We study the effect of compacting running tasks to free
more contiguous space on the system performance. First,
we introduce a straightforward compaction strategy called
Blind compaction. We use its performance as a reference to
measure the performance of other compaction algorithms.
Then we propose a two-dimensional compaction algorithm
called one-corner compaction. This algorithm runs with
respect to one chip corner. We further extend this algorithm
to the four corners of the chip and introduce the 4-corner
compaction algorithm. Finally, we compare the
performance of these algorithms with some existing
compaction strategies [3]. The simulation results show
improvement in average task allocation time when using
the 4-corner compaction algorithm by 15% and in chip
utilization by 16% over the Blind compaction. These results
outperform the existing strategies.
-
Low-Power Warp Processor for Power Efficient High-Performance Embedded Systems [p. 141]
-
R. Lysecky
Researchers previously proposed warp processors, a novel
architecture capable of transparently optimizing an executing
application by dynamically re-implementing critical kernels
within the software as custom hardware circuits in an on-chip
FPGA. However, the original warp processor design was
primarily performance-driven and did not focus on power
consumption, which is becoming an increasingly important
design constraint. Focusing on power consumption, we present
an alternative low-power warp processor design and
methodology that can dynamically and transparently reduce
power consumption of an executing application with no
degradation in system performance, achieving an average
reduction in power consumption of 74%. We further
demonstrate the flexibility of this approach to provide dynamic
control between high-performance and low-power consumption.
Keywords
Warp processing, low-power, hardware/software partitioning,
dynamically adaptable systems, embedded systems.
-
Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable
Devices [p. 147]
-
Y. Qu, J.-P. Soininen and J. Nurmi
In this paper, an approach that uses dynamic voltage
scaling (DVS) to reduce the configuration energy of runtime
reconfigurable devices is proposed. The basic idea is
to use configuration prefetching and parallelism to create
excessive system idle time and apply DVS on the
configuration process when such idle time can be utilized.
A genetic algorithm is developed to solve the task
scheduling and voltage assignment problem. With real
applications, the results show that up to 19.3% of
configuration energy can be reduced. When considering
the reduction of the configuration energy, the results show
that using more computation resources is more favorable
when the configuration latency is relatively small, and
using more configuration controllers is more favorable
for relatively large latency.
-
A Shift Register Based Clause Evaluator for Reconfigurable SAT Solver [p. 153]
-
M. Safar, M. Shalan, M. W. El-Kharashi and A. Salem
Several approaches have been proposed to accelerate
the NP-complete Boolean Satisfiability problem (SAT) using
reconfigurable computing. We present an FPGA based
clause evaluator, where each clause is modeled as a shift
register that is either right shifted, left shifted, or standstill
according to whether the current assigned variable value
satisfy, unsatisfy, or does not effect the clause, respectively.
For a given problem instance, the effect of the value of
each of its variables on its SAT formula is loaded in the
FPGA on-chip memory. This results in less configuration
effort and fewer hardware resources than other available
SAT solvers. Also, we present a new approach for implementing
conflict analysis based on a conflicting variables
accumulator and priority encoder to determine backtrack
level. Using these two new ideas, we implement an FPGA
based SAT solver performing depth-first search with nonchronological
conflict directed backtracking. We compare
our SAT solver with other solvers through instances from
DIMACS benchmarks suite.
Moderators: J. Dielissen, NXP Research, NL, N. Dutt, UC Irvine, US
-
Efficient High-Performance ASIC Implementation of JPEG-LS Encoder [p. 159]
-
M. Papadonikolakis, V. Pantazis and A. P. Kakarountas
This paper introduces an innovative design
which implements a high-performance JPEG-LS encoder. The
encoding process follows the principles of the JPEG-LS
lossless mode. The proposed implementation consists of an
efficient pipelined JPEG-LS encoder, which operates at a
significantly higher encoding rate than any other JPEG-LS
hardware or software implementation while keeping area
small.
Index Terms - Image processing, lossless compression, JPEGLS,
LOCO-I, VLSI implementation.
-
Improve CAM Power Efficiency Using Decoupled Match Line Scheme [p. 165]
-
Y.-J. Chang, Y.-H. Liao and S.-J. Ruan
Content addressable memory (CAM) is widely used in
many applications that require fast table lookup. Due to the
parallel comparison feature and high frequency of lookup,
however, the power consumption of CAM is usually significant.
In this paper we propose a decoupled match line scheme
which combines the performance advantage of the traditional
NOR-type CAM and the power efficiency of the traditional
NAND-type CAM. In our design, a CAM word is divided into
two segments, and then all the CAM cells are decoupled from
the match line. By minimizing both the match line
capacitances and switching activities, our design can largely
reduce the CAM power dissipated in search operations. The
results measured from the fabricated chip show that without
any performance penalty our design can reduce the search
energy consumption of the CAM by 89% compared to the
traditional NOR-type CAM design.
-
Cyclostationary Feature Detection on a Tiled-SoC [p. 171]
-
A. B. J. Kokkeler, G. J. M. Smit, T. Krol and J. Kuper
In this paper, a two-step methodology is introduced to
analyse the mapping of Cyclostationary Feature Detection
(CFD) onto a multi-core processing platform. In the first
step, the tasks to be executed by each core are determined
in a structured way using techniques known from the design
of array processors. In the second step, the implementa-
tion of tasks on a processing core is analysed. Using this
methodology, it is shown that calculating a 127 x 127 Discrete
Spectral Correlation Function requires approximately
140 μs on a tiled System on Chip (SoC) with 4 Montium
cores.
-
Mapping Control-Intensive Video Kernels onto a Coarse-Grain Reconfigurable Architecture:
The H.264/AVC Deblocking Filter [p. 177]
-
C. Arbelo, A. Kanstein, S. López, J. F. L&ocute;pez, M. Berekovic, R. Sarmiento and J.-Y. Mignolet
Deblocking filtering represents one of the most compute
intensive tasks in an H.264/AVC standard video decoder
due to its demanding memory accesses and irregular data
flow. For these reasons, an efficient implementation poses
big challenges, especially for programmable platforms. In
this sense, the mapping of this decoder's functionality
onto a C-programmable coarse-grained reconfigurable
architecture named ADRES (Architecture for
Dynamically Reconfigurable Embedded Systems) is
presented in this paper, including results from the
evaluation of different topologies. The results obtained
show a considerable reduction in the number of cycles
and memory accesses needed to perform the filtering as
well as an increase in the degree of instruction
parallelism (ILP) when compared with an implementation
on a Very Long Instruction Word (VLIW) dedicated
processor. This demonstrates that high ILP is achievable
on the ADRES even for irregular, data-dependent kernels.
-
An Efficient Hardware Architecture for H.264 Intra Prediction Algorithm [p. 183]
-
E. Sahin and I. Hamzaoglu
In this paper, we present an efficient hardware
architecture for real-time implementation of intra prediction
algorithm used in H.264 / MPEG4 Part 10 video coding
standard. The hardware design is based on a novel
organization of the intra prediction equations. This
hardware is designed to be used as part of a complete
H.264 video coding system for portable applications. The
proposed architecture is implemented in Verilog HDL. The
Verilog RTL code is verified to work at 90 MHz in a Xilinx
Virtex II FPGA. The FPGA implementation can process 27
VGA frames(640x480) per second.
-
An FPGA Implementation of Decision Tree Classification [p. 189]
-
R. Narayanan, D. Honbo, G. Memik, A. Choudhary and J. Zambreno
Data mining techniques are a rapidly emerging class of
applications that have widespread use in several fields. One
important problem in data mining is Classification, which
is the task of assigning objects to one of several predefined
categories. Among the several solutions developed, Decision
Tree Classification (DTC) is a popular method that
yields high accuracy while handling large datasets. However,
DTC is a computationally intensive algorithm, and as
data sizes increase, its running time can stretch to several
hours. In this paper, we propose a hardware implementation
of Decision Tree Classification. We identify the computeintensive
kernel (Gini Score computation) in the algorithm,
and develop a highly efficient architecture, which is further
optimized by reordering the computations and by using a
bitmapped data structure. Our implementation on a Xilinx
Virtex-II Pro FPGA platform (with 16 Gini units) provides
up to 5.58x performance improvement over an equivalent
software implementation.
-
Radix 4 SRT Division with Quotient Prediction and Operand Scaling [p. 195]
-
N.R. Srivastava
SRT division is an efficient method for implementing high radix division circuits. However,
as the radix increases the size of a quotient digit selection table increases exponentially. To
overcome the limitations of quotient prediction, a method in which a quotient digit is speculated
has been proposed. The speculated quotient digit is utilized to update the possible partial
remainders while the speculated quotient is corrected. In this paper, instead of using a huge
quotient selection table an estimation and correction scheme is used for prodiction of quotient
digit. The prediction is done in parallel with the calculation of the partial remainder for the
quotient predicted earlier thus improving the latency. In addition, since this method tends to
consume less area as the radix increases compared to previous methods, it has the ability to
improve higher radix implementations for SRT division.
Moderators: F. Novak, Jozef Stefan Institute, SL, R. Dorsch, IBM, Boeblingen, DE
-
SoC Testing Using LFSR Reseeding, and Scan-Slice-Based TAM Optimization and Test Scheduling [p. 201]
-
Z. Wang, K. Chakrabarty and S. Wang
We present an SoC testing approach that integrates
test data compression, TAM/test wrapper design, and test
scheduling. An improved LFSR reseeding technique is used as the
compression engine. All cores on the SoC share a single on-chip
LFSR. At any clock cycle, one or more cores can simultaneously
receive data from the LFSR. Seeds for the LFSR are computed
from the care bits from the test cubes for multiple cores. We
also propose a scan-slice-based scheduling algorithm that tries
to maximize the number of care bits the LFSR can produce
at each clock cycle, such that the overall test application time
is minimized. Experimental results for both ISCAS circuits and
industrial circuits show that optimal test application time, which
is determined by the largest core, can be achieved. The proposed
approach has small hardware overhead and is easy to deploy.
Only one LFSR, one phase shifter, and a few counters should
be added to the SoC. The scheduling algorithm is also scalable
for large industrial circuits. The CPU time for a large industrial
design ranges from 1 to 30 minutes.
-
Optimized Integration of Test Compression and Sharing for SoC Testing [p. 207]
-
A. Larsson, E. Larsson, P. Eles and Z. Peng
The increasing test data volume needed to test core-based
System-on-Chip contributes to long test application times (TAT)
and huge automatic test equipment (ATE) memory requirements.
TAT and ATE memory requirement can be reduced by test
architecture design, test scheduling, sharing the same tests
among several cores, and test data compression. We propose, in
contrast to previous work that addresses one or few of the
problems, an integrated framework with heuristics for sharing
and compression and a Constraint Logic Programming technique
for architecture design and test scheduling that minimizes the TAT
without violating a given ATE memory constraint. The
significance of our approach is demonstrated by experiments with
ITC'02 benchmark designs.
-
A Sophisticated Memory Test Engine for LCD Display Drivers [p. 213]
-
O. Spang, H.-M. Von Staudt and M.G. Wahl
Economic testing of small devices like LCD drivers
is a real challenge. In this paper we describe an approach
where a production tester is extended by a
memory test engine (MTE). This MTE, which consists
of hardware and software components allows testing
the LCD driver memory at speed, allowing at the same
time the concurrent execution of other tests. It is fully
integrated into the tester. The MTE leads to a significant
increase of memory test quality and at the same
time to a significant reduction of the test time. The test
time reduction that was achieved by executing the
memory test in parallel to other analog tests lead to the
test cost reduction, which was the impetus for developing
the MTE.
-
Formal Verification of a Pervasive Interconnect Bus System in a High-Performance Microprocessor [p. 219]
-
T. Le, T. Glökler and J. Baumgartner
In our high-performance PowerPC processor, the correctness
of the so-called pervasive interconnect bus system,
which provides, among others, Test and Debug access via
external interfaces like JTAG, is of utmost importance. In
this paper, we describe our approach in formally verifying
the correctness of this bus system to combat the coverage
problem of simulation-based techniques. The bus system
and the associated arbitration logic support several functionalities
such as deadlock detection and resolution. In order
to efficiently complete all of the required formal analysis
for verification, we needed to leverage a variety of proof
and semi-formal algorithms, as well as reduction and abstraction
algorithms. Experimental results are provided to
show the efficiency of this approach.
-
Low Cost Debug Architecture Using Lossy Compression for Silicon Debug [p. 225]
-
E. Anis and N. Nicolici
The size of on-chip trace buffers used for at-speed silicon
debug limits the observation window in any debug session.
Whenever the debug experiment can be repeated, we
propose a novel architecture for at-speed silicon debug that
enables a methodology where the designer can iteratively
zoom only in the intervals containing erroneous samples.
When compared to increasing the size of the trace buffer, the
proposed architecture has a small impact on silicon area,
while significantly reducing the number of debug sessions.
-
An SoC Test Scheduling Algorithm Using Reconfigurable Union Wrappers [p. 231]
-
T. Yoneda, M. Imanishi and H. Fujiwara
This paper presents a reconfigurable union wrapper that
can wrap multiple cores into a single wrapper design.
Moreover, we present a test scheduling algorithm to minimize
a test application time using the proposed reconfigurable
union wrapper. The proposed heuristic algorithm
can achieve short test application time with low computational
cost compared to the conventional approaches where
every core has its own wrapper. Experimental results for
the ITC'02 SOC Benchmarks show the effectiveness of our
approach.
keywords: system-on-a-chip, test scheduling, reconfigurable
union wrapper, test access mechanism
Moderator: A. González, Intel and UPC, ES
-
Microprocessors in the Era of Terascale Integration [p. 237]
-
S. Borkar, N.P. Jouppi and P. Stenstrom
Moore's Law will soon deliver tera-scale level transistor
integration capacity. Power, variability, reliability, aging,
and testing will pose as barriers and challenges to harness
this integration capacity. Advances in microarchitecture
and programming systems discussed in this paper are
potential solutions.
Moderators: G. Vandersteen, IMEC, BE, J. Roychowdhury, Minnesota U, US
-
CMCal: An Accurate Analytical Approach for the Analysis of Process Variations with
Non-Gaussian Parameters and Nonlinear Functions [p. 243]
-
M. Zhang, M. Olbrich, D. Seider, M. Frerichs, H. Kinzelbach and E. Barke
As technology rapidly scales, performance variations
(delay, power etc.) arising from process variation are
becoming a significant problem. The use of linear models has
been proven to be very critical in many today's applications. Even
for well-behaved performance functions, linearising approaches
as well as quadratic model provide serious errors in calculating
expected value, variance and higher central moments. In this
paper, we present a novel approach to analyse the impacts of
process variations with low efforts and minimum assumption.
We formulate circuit performance as a function of the random
parameters and approximate it by Taylor Expansion up to 4th
order. Taking advantage of the knowledge about higher moments,
we convert the Taylor series to characteristics of performance
distribution. Our experiments show that this approach provides
extremely exact results even in strongly non-linear problems with
large process variations. Its simpleness, efficiency and accuracy
make this approach a promising alternative to the Monte Carlo
Method in most practical applications.
-
A Symbolic Methodology for the Verification of Analog and Mixed Signal Designs [p. 249]
-
G. Al-Sammane, M. H. Zaki and S. Tahar
We propose a new symbolic verification methodology for
proving the properties of analog and mixed signal (AMS)
designs. Starting with an AMS description and a set of properties
and using symbolic computation, we extract a normal
mathematical representation for the system in terms of recurrence
equations. These normalized equations are used
along with an induction verification strategy defined inside
the computer algebra system Mathematica to prove the correctness
of the properties. We apply our methodology on a
third order DS modulator.
-
Efficient Nonlinear Distortion Analysis of RF Circuits [p. 255]
-
D. Tannir and R. Khazaka
Nonlinear distortion, typically defined using the third order
intercept point (IP3), is one of the key figures of merit
that are critical in the design of RF communication circuits.
The calculation of IP3 is typically based on analytical approaches
such as Volterra Series which are very complex
and difficult to apply to circuits of arbitrary complexity, or
on simulation based methods which require multi-tone inputs
and thus result in a very high CPU cost. In this paper
a new method based on the computation of the circuit moments
is proposed. The new approach uses the circuit moments
in order to numerically compute the Volterra kernels.
This automates the process of numerically obtaining such
kernels for any circuit and results in an efficient approach
for the computation of IP3.
-
Nonlinearity Analysis of Analog/RF Circuits Using Combined Multisine and Volterra Analysis [p. 261]
-
J. Borremans, L. De Locht, P. Wambacq and Y. Rolain
Modern integrated radio systems require highly linear
analog/RF circuits. Two-tone simulations are commonly used
to study a circuit's nonlinear behavior. Very often, however, this
approach suffers limited insight. To gain insight into nonlinear
behavior, we use a multisine analysis methodology to locate the
main nonlinear components (e.g. transistors) both for weakly and
strongly nonlinear behavior. Under weakly nonlinear conditions,
selective Volterra analysis is used to further determine the most
important nonlinearities of the main nonlinear components. As
shown with an example of a 90 nm CMOS wideband low-noise
amplifier, the insights obtained with this approach can be used to
reduce nonlinear circuit behavior, in this case with 10 dB. The
approach is valid for wideband and thus practical excitation
signals, and is easily applicable both to simple and complex
circuits.
-
Optimizing Analog Filter Designs for Minimum Nonlinear Distortions Using Multisine Excitations [p. 267]
-
J. Lataire, G. Vandersteen and R. Pintelon
Nonlinear distortions in submicron analog circuits are
gaining importance, especially when power constraints are
imposed and when operating in moderate inversion. This
paper proposes a method to optimize the design of analog
filters for minimum noise and nonlinear distortions. For
this purpose a technique is presented for quantifying these
nonlinearities, such that their influence can be compared
with that of the system noise. Having quantified the nonidealities,
an optimization can be carried out which
involves the tuning of design parameters.
Moderators: T. Schattkowsky, Paderborn U, DE, W. Klingauf, TU Braunschweig, DE
-
Performance Analysis of Complex Systems by Integration of Dataflow Graphs and
Compositional Performance Analysis [p. 273]
-
S. Schliecker, S. Stein and R. Ernst
In this paper we integrate two established approaches to
formal multiprocessor performance analysis, namely Synchronous
Dataflow Graphs and Compositional Performance Analysis.
Both make different trade-offs between precision and applicability.
We show how the strengths of both can be combined
to achieve a very precise and adaptive model. We couple these
models of completely different paradigms by relying on load
descriptions of event streams. The results show a superior performance
analysis quality.
-
Tackling an Abstraction Gap: Co-Simulating with SystemC DE and Bluespec ESL [p. 279]
-
H.D. Patel and S.K Shukla
The growing SystemC community for system level design exploration
is a result of SystemC's capability of modeling at RTL and
above RTL abstraction levels. However, managing shared state
concurrency using multi-threading in large SystemC models is error
prone. A recent extension of SystemC called Bluespec-SystemC
(BS-ESL) counters this difficulty with its model of computation
employing atomic rule-based specifications. However, for simulating
a model that is partly designed in SystemC and partly using
BS-ESL, an interoperability semantics and implementation of
such a semantics is required. This paper views the interoperability
problem as an abstraction gap closure problem. To illustrate
the problem, we formalize the simulation semantics of BS-ESL and
discrete-event simulation of RTL SystemC and provide a solution
based on this formalization.
-
A Calculator for Pareto Points [p. 285]
-
M. Geilen and T. Basten
This paper presents the Pareto Calculator, a tool for
compositional computation of Pareto points, based on the algebra
of Pareto points. The tool is a useful instrument for multidimensional
optimisation problems, design-space exploration and
development of quality management and control strategies. Implementations
and their complexity of the operations of the algebra
are discussed. In particular, we discuss a generalisation of the
well-known divide-and-conquer algorithm to compute the Pareto
points (optimal solutions) from a set of possible configurations,
also known as the maximal vector or skyline problem. The generalisation
lies in the fact that we allow for partially ordered domains
instead of only totally ordered ones. The calculator is available
through the following url: http://www.es.ele.tue.nl/pareto.
-
Modeling and Simulation to the Design of ZΔ Fractional-N Frequency Synthesizer [p. 291]
-
S. Huang, H. Ma and Z. Wang
A set of behavioral voltage-domain verilogA/verilog models
allowing a systematic design of the ΔΣ fractional-N
frequency synthesizer is discussed in the paper. The
approach allows the designer to accurately predict the
dynamic or stable characteristic of the closed loop by
including nonlinear effects of building blocks in the models.
The proposed models are implemented in a three-order ΔΣ
fractional-N PLL based frequency synthesizer with a 60MHz
frequency tuning range. Cadence SpectreVerilog simulation
results show that behavioral modeling can provide a great
speed-up over circuit-level simulation. Synchronously, the
phase noise, spurs and settling time can also be accurately
predicted, so it is helpful to a grasp of the fundamentals at
the early stage of the design and optimization design at the
system level. The key simulation results have been compared
against measured results obtained from an actual prototype
validating the effectiveness of the proposed models.
-
System Level Power Optimization of Sigma-Delta Modulator [p. 297]
-
F. Gong and X. Wu
A new approach to power optimization of the sigma delta
modulators was presented based on the modeling of noise
performance while deciding its system functions and the
sub-circuit specifications. And a system model of a 2nd
order modulator with a Matlab algorithm to optimize its
power specifications was developed. The system simulation
results showed that all specifications were consistent with
the expectations well. By using the proposed architecture, a
resolution of 16-bit was achieved.
-
Executable System-Level Specification Models Containing UML-Based Behavioral Patterns [p. 301]
-
L.S. Indrusiak, A. Thuy and M. Glesner
Behavioral patterns are useful abstractions to simplify the
design of the communication-centric systems. Such
patterns are traditionally described using UML diagrams,
but the lack of execution semantics in UML prevents the
co-validation of the patterns together with simulation
models and executable specifications which are the
mainstream in today's system level design flows. This
paper proposes a method to validate UML-based
behavioral patterns within executable system models. The
method is based on actor orientation and was
implemented as an extension of the Ptolemy II framework.
A case study is presented and potential applications and
extensions of the proposed method are discussed.
Moderators: W. Luk, Imperial College, London, UK, R. Lysecky, Arizona U, US
-
Assessing Carbon Nanotube Bundle Interconnect for Future FPGA Architectures [p. 307]
-
S. Eachempati, A. Nieuwoudt, A. Gayasen, N. Vijaykrishnan and Y. Massoud
Field Programmable Gate Arrays (FPGAs) are important
hardware platforms in various applications due to increasing
design complexity and mask costs. However, as
CMOS process technology continues to scale, standard copper
interconnect will become a major bottleneck for FPGA
performance. In this paper, we propose utilizing bundles of
single-walled carbon nanotubes (SWCNT) as wires in the
FPGA interconnect fabric and compare their performance
to standard copper interconnect in future process technologies.
To leverage the performance advantages of nanotubebased
interconnect, we explore several important aspects
of the FPGA routing architecture including the segmentation
distribution and the internal population of the wires.
The results demonstrate that FPGAs utilizing SWCNT bundle
interconnect can achieve a 19% improvement in average
area delay product over the best performing architecture for
standard copper interconnect in 22 nm process technology.
-
Two-Level Microprocessor-Accelerator Partitioning [p. 313]
-
S. Sirowy, Y. Wu, S. Lonardi and F. Vahid
The integration of microprocessors and field-programmable gate
array (FPGA) fabric on a single chip increases both the utility
and necessity of tools that automatically move software functions
from the microprocessor to accelerators on the FPGA to improve
performance or energy. Such hardware/software partitioning for
modern FPGAs involves the problem of partitioning functions
among two levels of accelerator groups - tightly-coupled
accelerators that have fast single-clock-cycle memory access to
the microprocessor's memory, and loosely-coupled accelerators
that access memory through a bridge to avoid slowing the main
clock period with their longer critical paths. We introduce this
new two-level accelerator-partitioning problem, and we describe
a novel optimal dynamic programming algorithm to solve the
problem. By making use of the size constraint imposed by FPGAs,
the algorithm has what is effectively quadratic runtime
complexity, running in just a few seconds for examples with up to
25 accelerators, obtaining an average performance improvement
of 35% compared to a traditional single-level bus architecture.
-
Design Space Exploration of Partially Re-Configurable Embedded Processors [p. 319]
-
A. Chattopadhyay, W. Ahmed, K. Karuri, D. Kammler, R. Leupers, G. Ascheid and H. Meyr
In today's embedded processors, performance and flexibility have
become the two key attributes. These attributes are often conflicting.
The best performance is obtained from custom designed integrated
circuits. In contrast, the maximum flexibility is delivered by a general
purpose processor. Among the architecture types emerged over the
past years to strike an optimum balance between these two attributes,
two are prominent. The first ones are Field Programmable Gate Array
(FPGA)-based architectures and the second ones are Applicationspecific
Instruction-set Processors (ASIPs). Depending on the type
of application (i.e. stream-like or control-dominated) either one of
the abovementioned architecture types is able to deliver high performance
or flexibility or both. Consequently, a new design approach
with partial re-configurability on the application-specific processor
is attracting strong research interest. We call this architecture reconfigurable
ASIP (rASIP). Currently, the lack of a high-level abstraction
of the rASIP limits the designer from trying out various
design alternatives because of long and tedious exploration cycles.
To address this issue, in this paper, a high-level specification for reconfigurable
processors is proposed. Furthermore, a seamless design
space exploration methodology using this specification is proposed.
-
Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor [p. 325]
-
H. Noori, F. Mehdipour, K. Murakami, K. Inoue and M. Goudarzi
To improve the performance of embedded processors,
an effective technique is collapsing critical computation
subgraphs as application-specific instruction set
extensions and executing them on custom functional units.
The problems of this approach are immense cost and long
time of designing. To address these issues, we propose an
adaptive extensible processor in which custom instructions
(CIs) are generated and added after chip-fabrication. To
support this feature, custom functional units are replaced
by a reconfigurable matrix of functional units with the
capability of conditional execution. Unlike previous
proposed CIs, ours can include multiple exits.
Experimental results show that multi-exit CIs enhance the
performance by 46% in average compared to CIs limited
to one basic block. A maximum speedup of 2.89 compared
to a 4-issue in-order RISC processor, and a speedup of
1.66 in average, was achieved on MiBench benchmark
suite.
Moderators: M. Heijligers, NXP IC Lab, NL, N. Wehn, Kaiserslautern U, DE
-
Low Complexity LDPC Code Decoders for Next Generation Standards [p. 331]
-
T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. When, N.E. L'Insalata, F. Rossi, M. Rovini
and L. Fanucci
This paper presents the design of low complexity LDPC
codes decoders for the upcoming WiFi (IEEE 802.11n),
WiMax (IEEE802.16e) and DVB-S2 standards. A complete
exploration of the design space spanning from the decoding
schedules, the node processing approximations up to
the top-level decoder architecture is detailed. According
to this search state-of-the-art techniques for a low complexity
design have been adopted in order to meet feasible
high throughput decoder implementations. An analysis
of the standardized codes from the decoder-aware point of
view is also given, presenting, for each one, the implementation
challenges (multi rates-length codes) and bottlenecks
related to the complete coverage of the standards. Synthesis
results on a present 65nm CMOS technology are provided
on a generic decoder architecture.
-
Non-Fractional Parallelism in LDPC Decoder Implementations [p. 337]
-
J. Dielissen and A. Hekstra
Because of its excellent bit-error-rate performance, the
Low-Density Parity-Check (LDPC) decoding algorithm is
gaining increased attention in communication standards
and literature. Also the new Chinese Digital Video Broadcast
standard (CDVB-T) uses LDPC codes. This standard
uses a large prime number as the parallelism factor, leading
to high area cost. In this paper we present a new method
to allow fractional dividers to be used. The method depends
on the property that consecutive sub-circulants have one
memory row in common. Several techniques are shown for
assuring this property, or solving memory conflicts, making
the method more generally applicable. In fact, the proposed
technique is a first step towards a general purpose
LDPC processor. For the CDVB-T decoder implementation
the method leads to a factor 3 improvement in area.
-
Minimum-Energy LDPC Decoder for Real-Time Mobile Application [p. 343]
-
W. Wang and G. Choi
This paper presents a low-power real-time decoder
that provides constant-time processing of each frame using
dynamic voltage and frequency scaling. The design uses known
capacity-approaching low-density parity-check(LDPC) code to
contain data over fading channels. Real-time applications require
guaranteed data rates. While conventional fixed-number
of decoding-iteration schemes are not energy efficient for mobile
devices, the proposed heuristic scheme pre-analyzes each received
data frame to estimate the maximum number of necessary
iterations for frame convergence. The results are then used
to dynamically adjust decoder frequency. Energy use is then
reduced appropriately by adjusting power supply voltage to
minimum necessary for the given frequency. The resulting design
provides a judicious trade-off between power consumption and
error level.
-
Pipelined Implementation of a Real Time Programmable Encoder for Low Density Parity
Check Code on a Reconfigurable Instruction Cell Architecture [p. 349]
-
Z. Khan and T. Arslan
This paper presents pipelined implementation of a real
time programmable irregular Low Density Parity
Check (LDPC) Encoder as specified in the IEEE
P802.16E/D7 standard. The encoder is programmable
for frame sizes from 576 to 2304 and for five different
code rates. H matrix is efficiently generated and stored
for a particular frame size and code rate. The encoder
is implemented on Reconfigurable Instruction Cell
Architecture which has recently emerged as an ultra
low power, high performance, ANSI-C programmable
embedded core. Different general and architecture
specific optimization techniques are applied to enhance
the throughput. With the architecture, a throughput
from 10 to 19 Mbps has been achieved. The maximum
throughput achieved with pipelining/ multi-core is 78
Mbps.
-
Implementation of AES/Rijndael on a Dynamically Reconfigurable Architecture [p. 355]
-
C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi and M. Toma
Reconfigurable architectures provide the user the capability
to couple performance typical of hardware design
with the flexibility of the software. In this paper, we present
the design of AES/Rijndael on a dynamically reconfigurable
architecture. We will show a performance improvement of
three order of magnitude compared to the reference code
and up to 24x speed-up figure wrt fast C implementations
over a RISC processor. A maximum throughput of 546
Mbit/sec is achieved. Compared to prior art, we show better
energy efficiency with respect to the other programmable
solutions, obtaining up to 3 Mbit/sec/mW.
Moderators: Z. Peng, Linkoping U, SE; J. Raik, TU Tallinn, ES
-
Using the Inter- and Intra-Switch Regularity in NoC Switch Testing [p. 361]
-
M. Hosseinabady, A. Dalirsani and Z. Navabi
This paper proposes an efficient test methodology to test
switches in a Network-on-Chip (NoC) architecture. A switch
in an NoC consists of a number of ports and a router. Using
the intra-switch regularity among ports of a switch and
inter-switch regularity among routers of switches, the
proposed method decreases the test application time and
test data volume of NoC testing. Using a test source to
generate test vectors and scan-based testing, this
methodology broadcasts test vectors through the minimum
spanning tree of the NoC and concurrently tests its switches.
In addition, a possible fault is detected by comparing test
results using the inter- or intra- switch comparisons. The
logic and memory parts of a switch are tested by
appropriate memory and logic testing methods.
Experimental results show less test application time and test
power consumption, as compared with other methods in the
literature.
-
Toward a Scalable Test Methodology for 2D-mesh Network-on-Chips [p. 367]
-
K. Petersén and J. Öberg
This paper presents a BIST strategy for testing the NoC
interconnect network, and investigates if the strategy is a
suitable approach for the task. All switches and links in
the NoC are tested with BIST, running at full clock-speed,
and in a functional-like mode. The BIST is carried out as
a go/no-go BIST operation at start up, or on command. It
is shown that the proposed methodology can be applied
for different implementations of deflecting switches, and
that the test time is limited to a few thousand-clock cycles
with fault coverage close to 100%.
-
Remote Testing and Diagnosis of System-on-Chips Using Network Management Frameworks [p. 373]
-
O. Laouamri and C. Aktouf
This paper presents a new approach that allows remote
testing and diagnosis of complex (Systems-on-Chip) and
embedded IP cores. The approach extends both on-chip
design-for-test (DFT) architectures and network
management protocols to take full benefits from existing
networking infrastructures. By running intensive
experimentation on ITC'99 and ITC'02 design
benchmarks, the efficiency of the proposed testing and
diagnosis methodology is analyzed.
Moderators: P. Pop, DTU, DK; S. Chakraborty, National U of Singapore, SG
-
Fast Memory Footprint Estimation Based on Maximal Dependency Vector Calculation [p. 379]
-
Q. Hu, A. Vandecappelle, P.G. Kjeldsberg, F. Catthoor and M. Palkovic
In data dominated applications, loop transformations have a
huge impact on the lifetime of array data and therefore on memory
footprint. Since a locally optimal loop transformation may have
a detrimental effect somewhere else, many alternative loop transformations
need to be explored. Therefore, estimation of the memory
footprint is essential, and this estimation has to be fast. This paper
presents a fast array based memory footprint estimation technique
based on counting of iteration nodes in an iteration domain constrained
by a maximal lifetime. The maximal lifetime is defined by
the Maximal Dependency Vector (MDV) of the array for a given
execution ordering. We further present for the first time two approaches
for calculation of the MDV: a general approach based
on an ILP formulation and a novel vertexes approach when iteration
domains are approximated by bounding boxes. Experiments on
practical test vehicles demonstrate that the estimation based on our
vertexes approach is extremely fast, on average two orders of magnitude
faster than the compared approaches, while still keeping the
accuracy high. This enables system-level data memory footprint exploration
of many different alternative transformed program codes,
within interactive time limits, and on realistic complex applications.
-
Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations [p. 385]
-
H. Zhu, I.I. Lucian and F. Balasa
The storage requirements of the array-dominated and looporganized
algorithmic specifications running on embedded systems
can be significant. Employing a data memory space much
larger than needed has negative consequences on the energy consumption,
latency, and chip area. Finding an optimized storage of
the usually large arrays from these algorithmic specifications is an
important step during memory allocation. This paper proposes an
efficient algorithm for mapping multi-dimensional arrays to the
data memory. Similarly to [13], it computes bounding windows
for live elements in the index space of arrays, but this algorithm is
several times faster. Moreover, since this algorithm works not only
for entire arrays, but also parts of arrays - like, for instance, array
references or, more general, sets of array elements represented
by lattices [11], this signal-to-memory mapping technique can be
also applied in multi-layer memory hierarchies.
-
The Impact of Loop Unrolling on Controller Delay in High Level Synthesis [p. 391]
-
S. Kurra, N.K. Singh and P.R. Panda
Loop unrolling is a well-known compiler optimization that
can lead to significant performance improvements. When
used in High Level Synthesis (HLS) unrolling can affect the
controller complexity and delay. We study the effect of the
loop unrolling factor on the delay of controllers generated
during HLS. We propose a technique to predict controller
delay as a function of the loop unrolling factor, and use this
prediction with other search space pruning methods to automatically
determine the optimal loop unrolling factor that
results in a controller whose delay fits into a specified time
budget, without an exhaustive exploration. Experimental results
indicate delay predictions that are close to measured
delays, yet significantly faster than exhaustive synthesis.
-
Clock-Frequency Assignment for Multiple Clock Domain Systems-on-a-Chip [p. 397]
-
S. Sirowy, Y. Wu, S. Lonardi and F. Vahid
Modern systems-on-a-chip platforms support multiple clock
domains, in which different sub-circuits are driven by different
clock signals. Although the frequency of each domain can be
customized, the number of unique clock frequencies on a platform
is typically limited. We define the clock-frequency assignment
problem to be the assignment of frequencies to processing modules,
each with an ideal maximum frequency, such that the sum of
module processing times is minimized, subject to a limit on the
number of unique frequencies. We develop a novel polynomial-time
optimal algorithm to solve the problem, based on dynamic
programming. We apply the algorithm to the particular context of
post-improvement of accelerator-based hardware/software
partitioning, and demonstrate 1.5x-4x additional speedups using
just three clock domains.
-
System-Level Process Variation Driven Throughput Analysis for Single and Multiple
Voltage-Frequency Island Designs [p. 403]
-
S. Garg and D. Marculescu
Manufacturing process variations are the primary cause of timing yield
loss in aggressively scaled technologies. In this paper, we analyze the
impact of process variations on the throughput (rate) characteristics of
embedded systems comprised of multiple voltage-frequency islands (VFIs)
represented as component graphs. We provide an efficient, yet accurate
method to compute the throughput of an application in a probabilistic scenario
and show that systems implemented with multiple VFIs are more
likely to meet throughput constraints than their fully synchronous counterparts.
The proposed framework allows designers to investigate the impact
of architectural decisions such as the granularity of VFI partitioning on
their designs, while determining the likelihood of a system meeting specified
throughput constraints. An implementation of the proposed framework
is accurate within 1.2% of Monte Carlo simulation while yielding speedups
ranging from 78X-260X, for a set of synthetic benchmarks. Results on
a real benchmark (MPEG-2 encoder) show that a nine clock domain implementation
gives 100% yield for a throughput constraint for which a fully
synchronous design only yields 25%. For the same throughput constraint,
a three clock domain architecture yields 78%.
-
Reliability-Aware System Synthesis [p. 409]
-
M. Glass, M. Lukasiewycz, T. Streichert, C. Haubelt and J. Teich
Increasing reliability is one of the most important design
goals for current and future embedded systems. In this
paper, we will put focus on the design phase in which reliability
constitutes one of several competing design objectives.
Existing approaches considered the simultaneous optimization
of reliability with other objectives to be too extensive.
Hence, they firstly design a system, secondly analyze
the system for reliability and finally exchange critical parts
or introduce redundancy in order to satisfy given reliability
constraints or optimize reliability. Unfortunately, this may
lead to suboptimal designs concerning other design objectives.
Here, we will present a) a novel approach that considers
reliability with all other design objectives simultaneously,
b) an evaluation technique that is able to perform a
quantitative analysis in reasonable time even for real-world
applications, and c) experimental results showing the effectiveness
of our approach.
Moderators: A. Rodriguez-Vazquez, AnaFocus, ES; M. Glesner, TU Darmstadt, DE
-
Flexibility-oriented Design Methodology for Reconfigurable Delta Sigma Modulators [p. 415]
-
P. Sun, Y. Wei and A. Dobili
This paper presents a systematic methodology for producing
reconfigurable ΣΔ modulator topologies with optimized
flexibility in meeting variable performance specifications.
To increase their flexibility, topologies are optimized
for performance attributes pertaining to ranges of values
rather than being single values. Topologies are implemented
on switched-capacitor reconfigurable mixed-signal
architectures. As the number of configurable blocks is very
small, it is extremely important that the topologies use as
few blocks as possible. A case study illustrates the methodology
for specifications from telecommunications area.
-
Experimental Validation of a Tuning Algorithm for High-Speed Filters [p. 421]
-
G. Matarrese, C. Marzocca, F. Corsi, S. D'Amico and A. Baschirotto
We report here the results of some laboratory
experiments performed to validate the effectiveness of a
technique for the self tuning of integrated continuoustime,
high-speed active filters. The tuning algorithm is
based on the application of a pseudo-random input
sequence of rectangular pulses to the device to be tuned
and on the evaluation of a few samples of the input-output
cross-correlation function which constitute the filter
signature.
The key advantages of this technique are the ease of the
input test pattern generation and the simplicity of the
output circuitry which consists of a digital crosscorrelator.
The technique allows to achieve a tuning error mainly
dominated by the value of the elementary capacitors
employed in the tuning circuitry. The time required to
perform the tuning is kept within a few microseconds. This
appears particularly interesting for applications to
telecommunication multi-standard terminals.
The experiments regarding the application of the
proposed tuning algorithm to a baseband multi-standard
filter confirm most of the simulation results and show the
robustness of the technique against practical operating
conditions and noise.
-
Design of High-Resolution MOSFET-Only Pipelined ADCs with Digital Calibration [p. 427]
-
H. Aminzadeh, M. Danaie and R. Lotfi
Design of low-voltage high-resolution MOSFET-only pipeline
analog to digital converters (ADCs) has been investigated in
this work. The nonlinearity caused by replacing linear MIM
capacitors with compensated depletion-mode MOS transistors
in all 1.5-bit residue stages of the ADC has been properly
modeled to be calibrated in digital domain. The proposed
calibration technique makes it possible to digitally
compensate the nonlinearity of a 1.8V 12-bit 65MS/s
MOSFET-only ADC in 0.18μm standard digital CMOS
technology. It improves the values of signal-to-noise-plusdistortion-
ratio (SNDR) and spurious-free dynamic range
(SFDR) by approximately 27dB and 35dB respectively.
-
A New Technique for Characterization of Digital-to-Analog Converters in High-Speed Systems [p. 433]
-
J. Savoj, A.-A. Abbasfar, A. Amirkhany, B. W. Garlepp and M. A. Horowitz
In this paper, a new technique for characterization of digital-toanalog
converters (DAC) used in wideband applications is
described. Unlike the standard narrowband approach, this
technique employs Least Square Estimation to characterize the
DAC from dc to any target frequency. Characterization is
performed using a random sequence with certain temporal and
probabilistic characteristics suitable for intended operating
conditions. The technique provides a linear estimation of the
system and decomposes nonlinearity into higher-order harmonics
and deterministic periodic noise. The technique can also be used
to derive the impulse response of the converter, predict its
operating bandwidth, and provide far more insight into its sources
of distortion.
Organizer: M. Casale-Rossi, Synopsys, Italy
Moderator: A. Strojwas, Carnegie Mellon U, US
-
DFM/DFY: Should You Trust the Surgeon or the Family Doctor? [p. 439]
-
Panelists: R. Aitken, A. Domic, C. Guardiani, P. Magarshack, D. Pattullo, J. Sawicki
Everybody agrees that curing DFM/DFY issues is of
paramount importance at 65 nanometers and beyond.
Unfortunately, there is disagreement about how and when
to cure them. "Surgeons" suggest a GDSII-centered
approach, potentially invasive, while "family doctors"
recommend a more pervasive approach, starting from
RTL. As in real life, "surgery" and "medicine" represent
two different schools of thought in the DFM/DFY arena.
Both involve risks.
This panel will examine these two approaches from
high-level design all the way to manufacturing. We have
assembled a set of panelists that represent a broad crosssection
of semiconductor industry. Although there is
general agreement among the panelists that both
approaches are necessary and that prevention is the best
way to proceed, they also acknowledge that the surgery
may be unavoidable in such "hazardous" conditions as
state-of-the-art technologies.
However, as always, "the devil is in the details," and
the diverse approaches to DFM presented below should
make this panel quite interesting. We are also counting
on the feedback from the IC design community to assess
if these approaches are sufficient and practical enough to
deal with the "health hazards". We are looking forward to
an exciting discussion that will challenge our esteemed
panelists.
Moderators: F. Ferrandi, Politecnico di Milano, IT; T. Henriksson, NXP Semiconductors Research, NL
-
Automatic Synthesis of Compressor Trees: Reevaluating Large Counters [p. 443]
-
A.K. Verma and P. Ienee
Despite the progress of the last decades in electronic design automation,
arithmetic circuits have always received way less attention
than other classes of digital circuits. Logic synthesisers, which
play a fundamental role in design today, play a minor role on most
arithmetic circuits, performing some local optimisations but hardly
improving the overall structure of arithmetic components. Architectural
optimisations have been often studied manually, and only
in the case of very common building blocks such as fast adders and
multi-input adders, ad-hoc techniques have been developed. A notable
case is multi-input addition, which is the core of many circuits
such as multipliers, etc. The most common technique to implement
multi-input addition is using compressor trees, which are often
composed of carry-save adders (based on (3 : 2) counters, i.e.,
full adders). A large body of literature exists to implement compressor
trees using large counters. However, all the large counters
were built by using full and half adders recursively. In this paper
we give some definite answers to issues related to the use of large
counters. We present a general technique to implement large counters
whose performance is much better than the ones composed of
full and half adders. Also we show that it is not always useful to
use larger optimised counters and sometimes a combination of various
size counters gives the best performance. Our results show
15% improvement in the critical path delay. In some cases even
hardware area is reduced by using our counters.
-
Area Optimization of Multi-Cycle Operators in High-Level Synthesis [p. 449]
-
M.C. Molina, R. Ruiz-Sautua, J.M. Mendias and R. Hermida
Conventional high-level synthesis algorithms usually
employ multi-cycle operators to reduce the cycle length in
order to improve the circuit performance. These operators
need several cycles to execute one operation, but the
entire functional unit is not used in any cycle.
Additionally, the execution of operations over wider
multi-cycle operators is unfeasible if their results must be
available in a smaller number of cycles than the
functional unit delay. This obliges to add new functional
resources to the datapath even if multi-cycle operators
are idle when the execution of the operation begins.
In this paper a new design technique to overcome the
restricted reusability of multi-cycle operators is
presented. It reduces the area of these functional units
allowing their internal reuse when executing one
operation. It also expands the possibilities of common
hardware sharing as it allows the partial use of multicycle
operators to calculate narrower operations faster
than the functional unit delay. This technique is applied as
an optimization phase at the end of the high-level
synthesis process, and can optimize the circuits
synthesized by any high-level synthesis tool.
-
Data-Flow Transformations Using Taylor Expansion Diagrams [p. 455]
-
M. Ciesielski, S. Askar, D. Gomez-Prado, J. Guillot and E. Boutillon
An original technique to transform functional representation
of the design into a structural representation in
form of a data flow graph (DFG) is described. A canonical,
word-level data structure, Taylor Expansion Diagram (TED),
is used as a vehicle to effect this transformation. The problem
is formulated as that of applying a sequence of decomposition
cuts to a TED that transforms it into a DFG optimized for a
particular objective. A systematic approach to arrive at such
a decomposition is described. Experimental results show that
such constructed DFG provides a better starting point for
architectural synthesis than those extracted directly from HDL
specifications.
-
Automatic Application Specific Floating-point Unit Generation [p. 461]
-
Y.J. Chong and S. Parameswaran
This paper describes the creation of custom floating point units (FPUs)
for Application Specific Instruction Set Processors (ASIPs). ASIPs
allow the customization of processors for use in embedded systems
by extending the instruction set, which enhances the performance of
an application or a class of applications. These extended instructions
are manifested as separate hardware blocks, making the creation of
any necessary floating point instructions quite unwieldy. On the other
hand, using a predefined FPU includes a large monolithic hardware
block with considerable number of unused instructions. A customized
FPU will overcome these drawbacks, yet the manual creation of one is
a time consuming, error prone process. This paper presents a methodology
for automatically generating floating-point units (FPUs) that
are customized for specific applications at the instruction level. Generated
FPUs comply with the IEEE754 standard, which is an advantage
over FP format customization. Custom FPUs were generated for
several Mediabench applications. Area savings over a fully-featured
FPU without resource sharing of 26%-80% without resource sharing
and 33%-87% with resource sharing, were obtained. Clock period
increased in some cases by up to 9.5% due to resource sharing.
-
Time-Constrained Clustering for DSE of Clustered VLIW-ASP [p. 467]
-
M. Schölzel
In this paper we describe a new time-constrained clustering
algorithm. It is coupled with a time-constrained
scheduling algorithm and used for Design-Space-Exploration
(DSE) of clustered VLIW processors with heterogeneous
clusters and heterogeneous functional units. The
algorithm enables us to reduce the complexity of the DSE,
because the parameters of the VLIW are derived from the
clustered schedule of the considered application which is
produced during a single compilation step. Several compilations
of the same application with different VLIWparameter
settings are not necessary. Our proposed algorithm
is integrated into a DSE-Tool in order to explore
the best parameters of a clustered VLIW processor for
several basic blocks of signal processing applications.
The obtained results are compared to the results of
Lapinskii's work and show, that, for most benchmarks, we
are able to save ports in the register file of each cluster.
Organizer/Moderator: P Liuha, Nokia, FI
-
Applications for Ubiquitous Computing and Communications [p. 473]
-
The session views some potential use cases and applications that use ubiquitous
computing and communications. The new aspects of these applications and the basic
design challenges and solutions will be addressed.
Moderators: L. Fanucci, Pisa U, IT; J. Gerlach, Robert Bosch GmbH, DE
-
Timing Simulation of Interconnected AUTOSAR Software-Components [p. 474]
-
M. Krause, O. Bringmann, A. Hergenhan, G. Tabanoglu and W. Rosenstiel
AUTOSAR is a recent specification initiative which
focuses on a model-driven architecture like methodology for
automotive applications. However, needed engineering
steps, or how-to-come from a logical to a technical architecture
respectively implementation, are not well supported by
tools, yet. In contrast, SystemC offers a comprehensive way
to simulate, analyze, and verify software. Furthermore, it is
even able to take the timing behavior of underlying hardware
and communication paths into account. Already at a
first glance, there are many similarities with respect to the
modeling structure between the both concepts. Therefore,
this paper discusses approaches on how to use SystemC during
the design process of AUTOSAR-conform systems.
-
FPGA-based Networking Systems for High Data-rate and Reliable In-vehicle Communications [p. 480]
-
S. Saponara, E. Petri, M. Tonarelli, I. Del Corona and L. Fanucci
The amount of electronic systems introduced in vehicles is
continuously increasing: X-by-wire, complex electronic
control systems and above all future applications such as
automotive vision and safety warnings require in-car
reliable communication backbones with the capability to
handle large amount of data at high speeds. To cope with
this issue and driven by the experience of aerospace
systems, the SpaceWire standard, recently proposed by the
European Space Agency (ESA), can be introduced in the
automotive field. The SpaceWire is a serial data link
standard which provides safety and redundancy and
guarantees to handle data-rates up to hundreds of Mbps.
This paper presents the design of configurable SpaceWire
router and interface hardware macrocells, the first in state
of the art compliant with the newest standard extensions,
Protocol Identification (PID) and Remote Memory Access
Protocol (RMAP). The macrocells have been integrated
and tested on antifuse technology in the framework of an
ESA project. The achieved performances of a router with 8
links, 130 Mbps data-rate, 1.5 W power cost, meet the
requirements of future automotive electronic systems. The
proposed networking solution simplifies the connectivity,
reducing also the relevant volume and mass budgets,
provides network safety and redundancy and guarantees
to handle very high bandwidth data flows not covered by
current standards as CAN or FlexRay.
-
Low-g Accelerometer Fast Prototyping for Automotive Applications [p. 486]
-
F. D'Ascoli, F. Iozzi, C. Marino, M. Melani, M. Tonarelli, L. Fanucci, A. Giambastiani, A. Rocchi
and M. De Marinis
This paper presents an application of the ISIF chip
(Intelligent Sensor InterFace), for conditioning a dualaxis
low-g accelerometer in MEMS technology.
MEMS are nowadays the standard in automotive
applications (and not only), as they feature a drastic
reduction in cost, area and power, while they require a
more complex electronic interface with respect to
traditional discrete devices. ISIF is a Platform On Chip
implementation, aiming to fast prototype a wide range of
automotive sensors thanks to its high configuration
resources, achieved both by full analog / digital IPs
trimming options and by flexible routing structures.
This accelerometer implementation exploits a relevant
part of ISIF hardware resources, but also requires signal
processing add-ins (software emulation of digital DSP
blocks) for the closed loop conditioning architecture and
for performance improvement (for example temperature
drift compensation).
In spite the short prototyping time, the resulting system
achieves good performances with respect to commercial
devices, featuring a 0.9 mg/√Hz noise density with 1024
LSB/g sensitivity on the digital output over a +/- 2g FS,
and an offset drift over 100°C range within 30 mg, with
2% of FS sensitivity drift. Miniboards have been
developed as product prototypes, consisting of a small
PCB with ISIF and accelerometer dies bonded together,
firmware embedded in EEPROM and communication
transceivers.
-
Using an Innovative SOC-level FMEA Methodology to Design in Compliance with IEC61508 [p. 492]
-
R. Mariani, G. Boschi and F. Colucci
This paper proposes an innovative methodology to
perform and validate a Failure Mode and Effects
Analysis (FMEA) at System-on-Chip (SoC) level. This is
done in compliance with the IEC 61508, an international
norm for the functional safety of electronic safety-related
systems, of which an overview is given in the paper. The
methodology is based on a theory to decompose a digital
circuit in "sensible zones" and a tool that automatically
extracts these sensible zones from the RTL description. It
includes as well a spreadsheet to compute the metrics
required by the IEC norm such Diagnostic Coverage and
Safe Failure Fraction. The FMEA results are validated
by using another tool suite including a fault injection
environment. The paper explains how to take benefits of
the information provided by such approach and as
example it is described how the methodology has been
applied to design memory sub-systems to be used in fault
robust microcontrollers for automotive applications.
This methodology has been approved by TÜV-SÜD as
the flow to assess and validate the Safe Failure Fraction
of a given SoC in adherence to IEC 61508.
-
Using Partial-Run-Time Reconfigurable Hardware to Accelerate Video Processing in
Driver Assistance Systems [p. 498]
-
C. Claus, J. Zeppenfeld, F. Müller and W. Stechele
In this paper we show a reconfigurable hardware architecture
for the acceleration of video-based driver assistance
applications in future automotive systems. The concept is
based on a separation of pixel-level operations and high
level application code. Pixel-level operations are accelerated
by coprocessors, whereas high level application code
is implemented fully programmable on standard PowerPC
CPU cores to allow flexibility for new algorithms. In addition,
the application code is able to dynamically reconfigure
the coprocessors available on the system, allowing for a
much larger set of hardware accelerated functionality than
would normally fit onto a device. This process makes use
of the partial dynamic reconfiguration capabilities of Xilinx
Virtex FPGAs.
-
Towards a Methodology for the Quantitative Evaluation of Automotive Architectures [p. 504]
-
P. Popp, M. Di Natale, P. Giusto, S. Kanajan and C. Pinello
Architecture design is a critical stage of the
Electronics/Controls/Software (ECS) -based vehicle design
flow. Traditional approaches relying on component-level
design and analysis are no longer effective as they do not
always allow for the quantitative evaluation of properties
arising from the composition of subsystems. This paper
presents a system level architecture design methodology that
is supported by tools and methods for the quantitative
evaluation of key metrics of interest related to timing,
dependability and cost. An example of its application to a bywire
system case study is presented, and the challenges faced
in its application in the context of the actual development
process are discussed.
Moderators: H. Obermeir, Infineon Technologies AG, DE; B. Straube, FhG IIS/EAS Dresden, DE
-
Dynamic Learning Based Scan Chain Diagnosis [p. 510]
-
Y. Huang
Scan chain defect diagnosis is important to silicon
debug and yield enhancement. Traditional simulationbased
chain diagnosis algorithms may take long run
time if a large number of simulations are required. In
this paper, a novel dynamic learning based scan chain
diagnosis is proposed to speedup the diagnosis run
time. Experimental results illustrate that by using the
proposed dynamic learning techniques, the diagnosis
run time can be reduced about 10X on average.
-
Diagnosis, Modeling and Tolerance of Scan Chain Hold-Time Violations [p. 516]
-
O. Sinanoglu and P. Schremmer
Errors in timing closure process during the physical design stage
may result in systematic silicon failures, such as scan chain hold
time violations, which prohibit the test of manufactured chips. In
this paper, we propose a set of techniques that enable the accurate
pinpointing of hold time violating scan cells, their modeling and
tolerance, paving the way for the generation of valid test data that
can be used to test chips with such systematic failures. The process
yield is thus restored, as chips that are functional in mission
mode can still be identified and shipped out, despite the existence
of scan chain hold time failures. The techniques that we propose
are non-intrusive, as they utilize only basic scan capabilities, and
thus impose no design changes. Scan cells with hold time violations
can be identified with maximal possible resolution, enabling
the incorporation of the associated impact during the ATPG process
and thus the generation of valid test data for the chips with
such systematic failures.
-
On Test Generation by Input Cube Avoidance [p. 522]
-
I. Pomeranz and S.M. Reddy
Test generation procedures attempt to assign values
to the inputs of a circuit so as to detect target faults. We
study a complementary view whereby the goal is to identify
values that should not be assigned to inputs in order
not to prevent faults from being detected. We describe a
procedure for computing input cubes (or incompletely
specified input vectors) that should be avoided during test
generation for target faults. We demonstrate that avoiding
such input cubes leads to the detection of target faults
after the application of limited numbers of random input
vectors. This indicates that explicit test generation is not
necessary once certain input values are precluded. Potential
uses of the computed input cubes are in a test generation
procedure to reduce the search space, and during
built-in test generation to preclude input vectors that will
not lead to the detection of target faults.
-
Slow Write Driver Faults in 65nm SRAM Technology: Analysis and March Test Solution [p. 528]
-
A. Ney, P. Girard, C. Landrault, S. Pravossoudovitch, A. Virazel and M. Bastian
This paper presents an analysis of the electrical origins
of Slow Write Driver Faults (SWDFs) [1] that may affect
SRAM write drivers in 65nm technology. This type of fault
is the consequence of resistive-open defects in the control
part of the write driver. It involves an erroneous write
operation when the same write driver performs two
successive write operations with opposite data values. In
the first part of the paper, we present the SWDF electrical
phenomena and their consequences on the SRAM
functioning. Next, we show how SWDFs can be sensitized
and observed and how a standard March test is able to
detect this type of fault.
-
On Power-profiling and Pattern Generation for Power-safe Scan Tests [p. 534]
-
V.R. Devanathan, C.P. Ravikumar and V. Kamakoti
With increasing use of low cost wire-bond packages for mobile
devices, excessive dynamic IR-drop may cause tests to fail on the
tester. Identifying and debugging such scan test failures is a very
complex and effort-intensive process. A better solution is to generate
correct-by-construction "power-safe" patterns. Moreover, with glitch
power contributing to a significant component of dynamic power,
pattern generation needs to be timing-aware to minimize glitching.
In this paper, we propose a timing-based, power and layout-aware
pattern generation technique that minimizes both global and localized
switching activity. Techniques are also proposed for power-profiling
and optimizing an initial pattern set to obtain a power-safe pattern
set, with the addition of minimal patterns. The proposed technique
also comprehends irregular power grid topologies for constraints
on localized switching activity. Experiments on ISCAS benchmark
circuits reveal the effectiveness of the proposed scheme.
-
Automatic Test Pattern Generation for Maximal Circuit Noise in Multiple Aggressor Crosstalk Faults [p. 540]
-
K.P. Ganeshpure and S. Kundu
Decreasing process geometries and increasing operating
frequencies have made VLSI circuits more susceptible to
signal integrity related failures. Capacitive crosstalk is one
of the causes of such kind of failures. Crosstalk fault results
from switching of neighboring lines that are capacitively
coupled. Long nets are more susceptible to crosstalk faults
because they tend to have a higher coupling capacitance to
overall capacitance ratio. A typical long net has multiple
aggressors. In generating patterns to create maximal
crosstalk noise, it may not be possible to activate all
aggressors at the same time. Therefore, pattern generation
must focus on activating a maximal subset of aggressors
weighted by actual coupling capacitance values. This is a
variant of max-satisfiability problem. Unlike a traditional
max-satisfiability problem, here we must deal with signal
propagation to an observable output. In this paper, we
present a novel solution that combines 0-1 Integer Linear
Program (ILP) with traditional stuck-at fault ATPG. The
maximal aggressor activation is formulated as a linear
programming problem while the fault effect propagation is
treated as an ATPG problem. The problems are separated
by min-cut circuit partitioning technique based on
Kernighan-Lin-Fiduccia-Mattheyses (KLFM) method. This
proposed technique was applied to ISCAS 85 benchmark
circuits. Results indicated that 75-100% of the aggressors
could be switched for generating crosstalk noise while
satisfying requirement of sensitizing a path to the output.
Moderators: V. Narayanan, Penn State U, US; C. Guiducci, Bologna U, IT
-
Temperature-aware NBTI Modeling and the Impact of Input Vector Control on Performance Degradation [p. 546]
-
Y. Wang, H. Luo, K. He, R. Luo, H. Yang and Y. Xie
As technology scales, Negative Bias Temperature Instability
(NBTI), which causes temporal performance degradation in digital
circuits by affecting PMOS threshold voltage, is emerging as
one of the major circuit reliability concerns. In this paper, we
first investigate the impact of NBTI on PMOS devices and propose
a novel temporal performance degradation model for digital circuits
considering the temperature difference between active and
standby mode.For the first time, the impact of input vector control
(to minimize standby leakage) on the NBTI is investigated. Minimum
leakage vectors, which lead to minimum circuit performance
degradation and remains maximum leakage reduction rate, are selected
and used during the standby mode. Furthermore, the potential
to save the circuit performance degradation by internal node
control techniques during circuit standby mode is discussed. Our
simulation results show that: 1) the active and standby time ratio
and the standby mode temperature have considerable impact on
the circuit performance degradation; 2) the NBTI-aware IVC technique
leads to an average 3% savings of the total circuit degradation;
while the potential of internal node control may lead to 10%
savings of the total circuit degradation.
-
A Cross-Referencing-Based Droplet Manipulation Method for High-Throughput and
Pin-Constrained Digital Microfluidic Arrays [p. 552]
-
T. Xu and K. Chakrabarty
Digital microfluidic biochips are revolutionizing high-throughput
DNA sequencing, immunoassays, and clinical diagnostics. As
high-throughput bioassays are mapped to digital microfluidic
platforms, the need for design automation techniques for
pin-constrained biochips is being increasingly felt. However, most
prior work on biochips CAD has assumed independent control of
the underlying electrodes using a large number of (electrical) input
pins. We propose a droplet manipulation method based on a
"cross-referencing" addressing method that uses "row" and
"columns" to access electrodes. By mapping the droplet movement
problem to the clique partitioning problem from graph theory, the
proposed method allows simultaneous movement of a large number
of droplets on a microfluidic array. This in turn facilitates
high-throughput applications on a pin-constrained biochip. We use
random synthetic benchmarks and a set of multiplexed bioassays to
evaluate the proposed method.
-
Reversible Circuit Technology Mapping from Non-reversible Specifications [p. 558]
-
Z. Zilic, K. Radecka and A. Khazamiphur
This paper considers the synthesis of reversible circuits directly from an
irreversible specification, with no need for producing a reversible embedding
first. We present a feasible methodology for realizing the networks of
reversible gates, in a manner that builds on the classical technology mapping.
We do not restrict ourselves to the restricted notion of realizing permutation
functions, and construct reversible implementations where extraneous signals are
efficiently reused for overcoming the inherent fanout limitation.
-
Distributed Power-Management Techniques for Wireless Network Video Systems [p. 564]
-
N. H. Zamora, J.-C. Kao and R. Marculescu
Wireless sensor networks operating on limited energy
resources need to be power efficient to extend the system
lifetime. This is especially challenging for video sensor
networks due to the large volumes of data they need to
process in short periods of time. Towards this end, this paper
proposes two coordinated power management policies for
video sensor networks. These policies are scalable as the
system grows and flexible to video parameters and network
characteristics. In addition to simulation results, our
prototype demonstrates the feasibility of implementing these
policies. Finally, the analytical framework we provide gives
an upper bound for the achievable sleep fraction and insight
into how adjusting select parameters will affect the
performance of the power management policies.
-
Improving the Fault Tolerance of Nanometric PLA Designs [p. 570]
-
F. Angiolini, M.H. Ben Jamaa, D. Atienza, L. Benini, and G. De Micheli
Several alternative building blocks have been proposed to replace
planar transistors, among which a prominent spot belongs to nanometric
filaments such as Silicon NanoWires (SiNWs) and Carbon
NanoTubes (CNTs). However, chips leveraging these nanoscale
structures are expected to be affected by a large amount of manufacturing
faults, way beyond what chip architects have learned
to counter. In this paper, we show a design flow, based on software
mapping algorithms, to improve the yield of nanometric Programmable
Logic Arrays (PLAs). While further improvements to
the manufacturing technology will be needed to make these devices
fully usable, our flow can significantly shrink the gap between current
and desired yield levels. Also, our approach does not need
post-fabrication functional analysis and mapping, therefore dramatically
cutting on verification costs. We check PLA yields by means
of an accurate analyzer after Monte Carlo fault injection. We show
that, compared to a baseline policy of wire replication, we achieve
equal or better yields (8% over a set of designs) depending on the
underlying defect assumptions.
-
Techniques for Designing Noise-Tolerant Multi-Level Combinational Circuits [p. 576]
-
K. Nepal, R.I. Bahar, J. Mundy, W.R. Patterson and A. Zaslavsky
As CMOS technology downscales, higher noise levels, wider
threshold variation, and low supply voltage will force designers
to contend with high rates of soft logical errors and many defective
devices. A probabilistic design framework based on Markov
random fields (MRF) has been previously proposed to address dynamic
fault and noise vulnerability of ultimate digital CMOS circuitry.
The idea is to use additional transistors and feedback loops
to achieve significant noise immunity and ensure correct logic operations
at low VDD. However, the extra reliability achieved in
previously published work came at a cost of high transistor counts.
In this paper, we present techniques to reduce the transistor count
of larger multi-level combinational circuits built within the MRF
framework by using variable sharing, implied dependence and supergates.
Using these techniques we show an average reduction
of approximately 28% in transistor counts over a range of combinational
benchmark circuits built within the MRF framework compared
to the best previously published results.
Moderators: T. Austin, U of Michigan, US; B. Calder, Microsoft, US
-
An Efficient Code Compression Technique Using Application-Aware Bitmask and Dictionary
Selection Methods [p. 582]
-
S.-W. Seong and P. Mishra
Memory plays a crucial role in designing embedded systems.
A larger memory can accommodate more and large applications
but increases cost, area, as well as energy requirements.
Code compression techniques address this problem by reducing
the size of the applications. While early work on bitmask-based
compression has proposed several promising ideas, many challenges
remain in applying them to embedded system design.
This paper makes two important contributions to address these
challenges by developing application-specific bitmask selection
and bitmask-aware dictionary selection techniques. We applied
these techniques for code compression of TI and MediaBench
applications to demonstrate the usefulness of our approach.
-
Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints [p. 588]
-
K. Atasu, R.G. Dimond, O. Mencer, W. Luk, C. Özturan and G. Dündar
We present a methodology for generating optimized architectures
for data bandwidth constrained extensible processors.
We describe a scalable Integer Linear Programming
(ILP) formulation, that extracts the most profitable set
of instruction-set extensions given the available data bandwidth
and transfer latency. Unlike previous approaches,
we differentiate between number of inputs and outputs for
instruction-set extensions and the number of register file
ports. This differentiation makes our approach applicable
to architectures that include architecturally visible state
registers and dedicated data transfer channels. We support
a comprehensive design space exploration to characterize
the area/performance trade-offs for various applications.
We evaluate our approach using actual ASIC implementations
to demonstrate that our automatically customized processors
meet timing within the target silicon area. For an
embedded processor with only two register read ports and
one register write port, we obtain up to 4.3x speed-up with
extensions incurring only a 35% area overhead.
-
Resource Prediction for Media Stream Decoding [p. 594]
-
J. Hamers and L. Eeckhout
Resource prediction refers to predicting required compute
power and energy resources for consuming a service
on a device. Resource prediction is extremely useful in a
client-server setup where the client requests a media service
from the server or content provider. The content provider (in
cooperation with the client) can then determine what service
quality to deliver given the client's available resources.
This paper proposes a practical approach to predicting
resources for decoding media streams. The idea is to group
frames with similar decode complexity from various media
streams in the content provider's database into so called
scenarios. Client profiling using scenario representatives
characterizes the client's computational power. This enables
the content provider for predicting decode time, decode
energy and quality of service for a media stream of
interest once deployed on the client.
-
Register Pointer Architecture for Efficient Embedded Processors [p. 600]
-
J.S. Park, S.-B. Park, J.D. Balfour, D. Black-Schaffer, C. Kozyrakis and W.J. Dally
Conventional register file architectures cannot optimally
exploit temporal locality in data references due to their limited
capacity and static encoding of register addresses in
instructions. In conventional embedded architectures, the
register file capacity cannot be increased without resorting
to longer instruction words. Similarly, loop unrolling is often
required to exploit locality in the register file accesses
across iterations because naming registers statically is inflexible.
Both optimizations lead to significant code size increases,
which is undesirable in embedded systems.
In this paper, we introduce the Register Pointer Architecture
(RPA), which allows registers to be accessed indirectly
through register pointers. Indirection allows a larger register
file to be used without increasing the length of instruction
words. Additional register file capacity allows many
loads and stores, such as those introduced by spill code, to
be eliminated, which improves performance and reduces energy
consumption. Moreover, indirection affords additional
flexibility in naming registers, which reduces the need to
apply loop unrolling in order to maximize reuse of register
allocated variables.
-
Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using
Random Search [p. 606]
-
S. Van Haastregt and P.M.W. Knijnenburg
When designing embedded systems, one needs to make
decisions concerning the different components that will be
included in a microprocessor. An important issue is the chip
area vs. performance trade-off. In this paper we investigate
the relationship between chip area and performance for superscalar
microprocessors. We investigate the feasibility to
obtain a suitable configuration by searching. We show that
our approach gives a good configuration after 100 to 150
iterations using a simple random search algorithm. This
shows the feasibility of our approach, in particular when
more sophisticated search algorithms are employed as we
plan in future work.
-
A Decoupled Architecture of Processors with Scratch-Pad Memory Hierarchy [p. 612]
-
A. Milidonis, N. Alachiotis, V. Porpodas, H. Michail, A.P. Kakarountas and C.E. Goutis
We present a decoupled architecture of processors with a
memory hierarchy of only scratch-pad memories, and a
main memory. The decoupled architecture also exploits the
parallelism between address computation and processing
the application data. The application code is split in two
programs the first for computing the addresses of the data
in the memory hierarchy and the second for processing the
application data. The first program is executed by one of
the decoupled processors called Access which uses
compiler methods for placing data in the memory
hierarchy. In parallel, the second program is executed by
the other processor called Execute. The synchronization of
the memory hierarchy and the Execute processor is
achieved through simple handshake protocol. The Access
processor requires strong communication with the memory
hierarchy which strongly differentiates it from traditional
uniprocessors. The architecture is compared in
performance with the MIPS IV architecture of SimpleScalar
and with the existing decoupled architectures showing its
higher normalized performance. Experimental results show
that the performance is increased up to 3.7 times.
Compared with MIPS IV the proposed architecture
achieves the above performance with insignificant
overheads in terms of area.
Moderators: A.J. Acosta, Seville U/IMSE, ES; B.C. Paul, Toshiba, US
-
An Algorithm to Minimize Leakage through Simultaneous Input Vector Control and Circuit Modification [p. 618]
-
N. Jayakumar and S.P. Khatri
Leakage power currently comprises a large fraction of
the total power consumption of an IC. Techniques to minimize
leakage have been researched widely. In this paper,
we present an approach which minimizes leakage by simultaneously
modifying the circuit while deriving the input vector
that minimizes leakage. In our approach, we selectively
modify a gate so that its output (in sleep mode) is in a state
which helpsminimize the leakage of other gates in its transitive
fanout. Gate replacement is performed in a slack-aware
manner, to minimize the resulting delay penalty.
-
Understanding Voltage Variations in Chip Multiprocessors Using a Distributed Power-Delivery Network [p. 624]
-
M.S. Gupta, J.L. Oatley, R. Joseph, G.-Y. Wei and D.M. Brooks
Recent efforts to address microprocessor power
dissipation through aggressive supply voltage scaling and power
management require that designers be increasingly cognizant of
power supply variations. These variations, primarily due to fast
changes in supply current, can be attributed to architectural
gating events that reduce power dissipation. In order to study
this problem, we propose a fine-grain, parameterizable model
for power-delivery networks that allows system designers to
study localized, on-chip supply fluctuations in high-performance
microprocessors. Using this model, we analyze voltage variations
in the context of next-generation chip-multiprocessor (CMP)
architectures using both real applications and synthetic current
traces. We find that the activity of distinct cores in CMPs present
several new design challenges when considering power supply
noise, and we describe potentially problematic activity sequences
that are unique to CMP architectures.
-
Process Variation Tolerant Low Power DCT Architecture [p. 630]
-
N. Banerjee, G. Karakonstantis and K. Roy
2-D Discrete Cosine Transform (DCT) is widely used as
the core of digital image and video compression. In this paper, we
present a novel DCT architecture that allows aggressive voltage
scaling by exploiting the fact that not all intermediate computations
are equally important in a DCT system to obtain "good" image
quality with Peak Signal to Noise Ratio(PSNR) > 30 dB. This
observation has led us to propose a DCT architecture where the
signal paths that are less contributive to PSNR improvement are
designed to be longer than the paths that are more contributive to
PSNR improvement. It should also be noted that robustness with
respect to parameter variations and low power operation typically
impose contradictory requirements in terms of architecture design.
However, the proposed architecture lends itself to aggressive
voltage scaling for low-power dissipation even under process
parameter variations. Under a scaled supply voltage and/or
variations in process parameters, any possible delay errors would
only appear from the long paths that are less contributive towards
PSNR improvement, providing large improvement in power
dissipation with small PSNR degradation. Results show that even
under large process variation and supply voltage scaling (0.8V),
there is a gradual degradation of image quality with considerable
power savings (62.8%) for the proposed architecture when
compared to existing implementations in 70 nm process technology.
-
Statistical Dual-Vdd Assignment for FPGA Interconnect Power Reduction [p. 636]
-
Y. Lin and L. He
Field programmable dual-Vdd interconnects are effective to
reduce FPGA power. However, the deterministic Vdd assignment
leverages timing slack exhaustively and significantly
increases the number of near-critical paths, which results in
a degraded timing yield with process variation. In this paper,
we present two statistical Vdd assignment algorithms.
The first greedy algorithm is based on sensitivity while the
second one is based on timing slack budgeting. Both minimize
chip-level interconnect power without degrading timing
yield. Evaluated with MCNC circuits, the statistical algorithms
reduce interconnect power by 40% compared to the
single-Vdd FPGA with power gating. In contrast, the deterministic
algorithm reduces interconnect power by 51% but
degrades timing yield from 97.7% to 87.5%.
Moderators: K. Goossens, NXP Semiconductors Research, NL; B. Candaele, Thales Communications, FR
-
Hardware Scheduling Support in SMP Architectures [p. 642]
-
A.C. Nácul, F. Regazzoni and M. Lajolo
In this paper we propose a hardware real time operating
system (HW-RTOS) that implements the OS layer in a
dual-processor SMP architecture. Intertask communication
is specified by means of dedicated APIs and the HW-RTOS
takes care of the communication requirements of the application
and also implements the task scheduling algorithm.
The HW-RTOS allows to have smaller footprints, since it
avoids the need to link to the final executables traditional
software RTOS libraries. Moreover, the HW-RTOS is able
to exploit the easy task migration feature provided by an
SMP architecture much more efficiently than a traditional
software RTOS, due to its faster execution and we show
how this significantly overcomes the performance achievable
with optimal static task partitioning among two processors.
Preliminary results show that the hardware overhead
in a dual processor architecture is less than 20K gates.
-
A Scalable, Timing-Safe, Network-on-Chip Architecture with an Integrated Clock Distribution Method [p. 648]
-
T. Bjerregaard, M.B. Stensgaard and J. Sparsø
Growing system sizes together with increasing performance
variability are making globally synchronous operation
hard to realize. Mesochronous clocking constitutes
a possible solution to the problems faced. The most fundamental
of problems faced when communicating between
mesochronously clocked regions concerns the possibility
of data corruption caused by metastability. This paper
presents an integrated communication and mesochronous
clocking strategy, which avoids timing related errors while
maintaining a globally synchronous system perspective.
The architecture is scalable as timing integrity is based
purely on local observations. It is demonstrated with a
90 nm CMOS standard cell network-on-chip design which
implements completely timing-safe, global communication
in a modular system.
-
Butterfly and Benes-Based On-Chip Communication Networks for Multiprocessor Turbo Decoding [p. 654]
-
H. Moussa, O. Muller, A. Baghdadi and M. Jezequel
Several research activities have recently emerged
aiming to propose multiprocessor implementations in order
to achieve flexible and high throughput parallel iterative
decoding. Besides application algorithm optimizations and
application-specific instruction-set processor design, the on-chip
communication network constitutes a major issue in
this application domain. In this paper, we propose to use
multistage interconnection networks as on-chip
communication networks for parallel turbo decoding.
Adapted Benes and Butterfly networks are proposed with
detailed hardware implementation of network interfaces,
routers, and topologies. In addition, appropriate packet
format and routing for interleaved/deinterleaved extrinsic
information exchanges are proposed. The flexibility of these
on-chip communication networks enables their use for all
turbo code standards and constitutes a promising feature for
their reuse for any similar interleaved/deinterleaved
iterative communication profile.
-
Capturing the Interaction of the Communication, Memory and I/O Subsystems in Memory-Centric
Industrial MPSoC Platforms [p. 660]
-
S. Medardoni, M. Ruggiero, D. Bertozzi, L. Benini, G. Strano and C. Pistritto
Industrial MPSoC platforms exhibit increasing communication needs while not yet
reverting to revolutionary solutions such as networks-on-chip. On one hand, the
limited scalability of shared busses is being overcome by means of multi-layer
communication architectures, which are stressing the role of bridges as key
contributors to system performance. On the other hand, technology limitations,
data footprint and cost constraints lead to platform instantiations with only
few on-chip memory devices and with a global performance bottleneck: the memory
controller for access to the off-chip SDRAM memory. The complex interaction among
system components and the dependency of macroscopic performance metrics on
fine-grain architectural features stress the importance of highly accurate
modelling and analysis tools. This paper takes its steps from an extensive
modelling effort of a complete industrial MPSoC platform for consumer electronics,
including the off-chip memory sub-system. Based on this, relevant design issues
concerning the communication, memory and I/O architecture and their interaction
are addressed, resulting in guidelines for designers of industry-relevant
MPSoCs.
Organizer/Moderator: P. Liuha, Nokia, FI
-
Cost-Aware Capacity Optimization in Dynamic Multi-Hop WSNs [p. 666]
-
J. Suhonen, M. Kohvakka, M. Kuorilehto, M. Hännikäinen, and T.D. Hämäläinen
Low energy consumption and load balancing are required
for enhancing lifetime at Wireless Sensor Networks
(WSN). In addition, network dynamics and different delay,
throughput, and reliability requirements demand costaware
traffic adaptation. This paper presents a novel capacity
optimization algorithm targeted at locally synchronized,
low-duty cycle WSN MACs. The algorithm balances
the traffic load between contention and contention free
channel access. The energy-inefficient contention access
is avoided, whereas the more reliable contention free access
is preferred. The algorithm allows making cost-aware
trade-off between delay, energy-efficiency, and throughput
guided by routing layer. Analysis results show that the algorithm
has 10% to 100% better energy-efficiency than IEEE
802.15.4 LR-WPAN in a typical sensing application, while
providing comparable goodput and delay.
-
Design Methods for Security and Trust [p. 672]
-
I. Verbauwhede and P. Schaumont
The design of ubiquitous and embedded computers
focuses on cost factors such as area, power-consumption,
and performance. Security and trust properties, on the
other hand, are often an afterthought. Yet the purpose of
ubiquitous electronics is to act and negotiate on their
owner's behalf, and this makes trust a first-order concern.
We outline a methodology for the design of secure and
trusted electronic embedded systems, which builds on
identifying the secure-sensitive part of a system (the
root-of-trust) and iteratively partitioning and protecting
that root-of-trust over all levels of design abstraction.
This includes protocols, software, hardware, and circuits.
We review active research in the area of secure design
methodologies.
-
Emerging Solutions Technology and Business Views for the Ubiquitous Communication [p. 678]
-
H. Huomo
The presentation will cover a short historical overview of the ubiquitous
communication research Dr Huomo was leading while at Nokia. This research
program led to the development of the short range radio technology which is now
known as Wibree and touch based service discovery technology now known as NFC.
The current key use cases of the NFC and its future development directions will be
covered.
Moderators: L. Fanucci, Pisa U, IT; A. Reutter, Robert Bosch GmbH, DE
-
Development of on Board, Highly Flexible, Galileo Signal Generator ASIC [p. 679]
-
L. Baguena, E. Liégeon, A. Bepoix, J. M. Dusserre, C. Oustric, P. Bellocq and V. Heiries
Alcatel Alenia Space is deeply involved in the Galileo
program at many stages. In particular, Alcatel Alenia
Space has successfully designed and delivered the very
first navigation signal generator, based on a 0.35μm
Atmel ASIC technology, which has been launched in the
satellite demonstrator GIOVE-A in December 2005.
The Galileo project is now in a second phase including
the development of four of the thirty satellites of the final
constellation. The new navigation signal generator
requires both high performance and high flexibility
(various waveforms to cope with the different Galileo
services: open, commercial, governmental ...) for a very
long life time system. Besides, the challenge is increased
due to the specific space constraints such as mass, volume
and power consumption. These requirements will be
achieved through the implementation of a 3 million gates
ASIC in a 0.18μm European Radiation Tolerant Atmel
technology.
This paper will, after a brief description of Galileo
system, present the constraints of space environment and
technologies challenges. It will then present the ASIC and
the development flow of this project, emphasizing the up
to date tools that have been used (architectural synthesis,
physical synthesis). A conclusion will then be drawn on
the requirements on technology and tools for space
domain.
-
New Safety Critical Radio Altimeter for Airbus and Related Design Flow [p. 684]
-
D. Hairion, S. Emeriau, E. Combot and M. Sarlotte
The latest generation of the ERT560 Digital Radio
Altimeter (DRA) developed for the Airbus A380 is
the result of Thales' 40 years experience. Over
40,000 radio-altimeters have been produced over
that period based on dual technology, meeting the
stringent requirements of the civil aircraft. This
new version takes advantages of the FPGA
technology to implement the main treatment of the
equipment.
The present article introduces the main
capabilities of the ERT560 product and focus on the
FPGA which is the key element of the safety critical
analysis of the radio-altimeter. Then the paper
presents the application of the new "design
Assurance guidance for Airborne Electronic
Hardware (DO254) which has been raised in 2000
(this guide is the equivalent for the HW of the
DO178B for the SW). DO254 related activities are
mainly developed such as a dedicated workflow,
validation (give evidence of the completeness and
correctness of all design life cycle outputs) and
verification (evaluation of an implementation of
requirements to determine that they have been met)
and also verification tool qualification.
-
Introducing New Verification Methods into a Company's Design Flow: An Industrial User's Point of View [p. 689]
-
R. Lissel and J. Gerlach
Today the task of design verification has become one of
the key bottlenecks in hardware and system design. To address
this topic, several verification languages, methods
and tools, which address several issues of the verification
process, were developed by multiple EDA vendors over the
last years. This paper takes an industrial user's point of
view and explores the difficulties introducing new verification
methods into a company's "naturally grown" and well
established design flow - taking into account application
domain specific requirements, constraints given by the
existing design environment and economical aspects. The
presented approach extends the capabilities of an existing
verification strategy by powerful new features while keeping
in mind integration, reuse and applicability aspects. Based
on an industrial design example the effectiveness and potential
of the developed approach is shown.
Moderators: A. Chatterjee, Georgia Institute of Technology, US; B. Kaminska, Simon Fraser U, CA
-
Testable Design for Advanced Serial-Link Transceivers [p. 695]
-
M. Lin and K.-T. Cheng
This paper describes a DfT solution for modern seriallink
transceivers. We first summarize the architectures of
the Crosstalk Canceller and the Equalizer used in advanced
transceivers to which the proposed solution can be
applied. The solution addresses the testability and observability
issues of the transceiver for both characterization
and production testing. Without using sophisticated
testing instrument setting, the proposed solution could test
the clock and data recovery circuit and characterize the
decision-feedback equalizer in the receiver. Our experiments
demonstrate that the proposed method has significant
higher fault coverage and lower hardware requirement
than the conventional approach of probing the eyeopening
of the signals inside the transceiver.
-
Method for Reducing Jitter in Multi-Gigahertz ATE [p. 701]
-
D.C. Keezer, D. Minier and P. Ducharme
Controlling jitter on a picosecond (or smaller) time
scale has become one of the most difficult challenges for
testing multi-gigahertz systems. In this paper we present
a novel method for reducing jitter in timing-critical ATE
signals. This method uses a real-time averaging
approach to combine multiple ATE signals and produces
timing references with significantly lower random jitter.
For example, we demonstrate a 3x reduction in jitter by
combining eight ATE signals (each with σ =4ps) to
produce a low-jitter signal (σ =1.3ps). The measured
jitter reduction is shown to closely match that predicted
by theory. This counter-intuitive (but welcome) result is
of general interest for the design of any low-jitter system,
and is particularly helpful for multi-GHz ATE where
precise timing is so critical.
-
Re-Configuration of Sub-blocks for Effective Application of Time Domain Tests [p. 707]
-
J. Anders, S. Krishnan and G. Gronthoud
AC sensitivities guide most Analogue Automatic Test Pattern Generator (AATPG)
while determining the optimal frequencies of a sinusoidal test stimulus. The
optimal frequencies thus determined normally lie in the close vicinity of
the operating frequency of the circuit. Although these frequencies are
justifiable by the principles of the circuit, these test frequencies do not
bring any added value to the ultimate goal of cheap alternatives (low
frequency test signal and cheaper measurement equipment) for the analogue
and RF tests. In this paper, we propose to re-configure the circuit blocks,
in such a way that the operating frequencies of the respective sub-block are
shifted to lower testable frequencies. We have validated our proposal on a
sub-block of a satellite receiver circuit that resulted in lowering the test
frequencies of the corresponding sub-blocks from 12 GHz to 4MHz, while attaining
the same level of defect coverage.
-
An ADC-BiST Scheme Using Sequential Code Analysis [p. 713]
-
E.S. Erdogan and S. Ozev
This paper presents a built-in self-test (BiST) scheme for analog to digital
converters (ADC) based on a linear ramp generator and efficient output
analysis. The proposed analysis method is an alternative to histogram
based analysis techniques to provide test time improvements, especially
when the resources are scarce. In addition to the measurement of DNL
and INL, non-monotonic behavior can also be detected with the proposed
technique. We present two implementation options based on how much
on-chip resources are available. The ramp generator has a high linearity
over a full-scale range of 1V and the generated ramp signal is capable of
testing 13 - bit ADCs. The circuit implementation of the ramp generator
utilizes a feedback configuration to improve the linearity having an area of
0.017mm2 in 0.5μm process.
-
Boosting SER Test for RF Transceivers by Simple DSP Technique [p. 719]
-
J. Dabrowski and R. Ramzan
The paper presents a new technique of symbol error rate
test (SER) for RF transceivers. A simple DSP algorithm
implemented at the receiver baseband is introduced in
terms of constellation correction, which is usually used to
compensate for IQ imbalance. The test is oriented at
detection of impairments in gain and noise figure in a
transceiver frontend. The proposed approach is shown to
enhance the sensitivity of a traditional SER test to the
limits of its counterpart, the error vector magnitude
(EVM) test. Its advantage over EVM is in simple
implementation, lower DSP overhead and the ability of
achieving a larger dynamic range of the test response.
Also the test time is saved compared to a traditional SER
test. The technique is validated by a simulation model of a
Wi-Fi transceiver implemented in MatlabTM.
-
Novel Test Infrastructure and Methodology Used for Accelerated Bring-Up and In-System
Characterization of the Multi-Gigahertz Interfaces on the Cell Processor [p. 725]
-
P. Yeung, A. Torres and P. Batra
Design-for-test (DFT) techniques are continuously used
in designs to help identify defects during silicon
manufacturing. However, prior to production, a significant
amount of time and effort is needed to bring-up and
validate various aspects of the silicon design in the system.
In particular, the use of multi-Gigabit I/O signaling for a
high I/O count, high-volume product introduces unique test
challenges during these two phases of the product life
cycle.
In this paper, we shall discuss the test infrastructure and
methodologies used to accelerate bring-up and in-system
silicon characterization for high-speed mixed-signal I/O.
These ideas will lead to a shortened time to market (TTM)
at a lower cost. As a case study, we shall illustrate these
techniques used in the development of the Rambus
FlexIOTM processor bus and XIOTM memory interface used
on the first generation Cell processor (aka Cell Broadband
EngineTM or Cell BE). Cell was co-developed by Sony
Corporation, Sony Computer Entertainment Inc, Toshiba
Corporation, and IBM and is used in the Sony
PlayStation®3 (PS3) game console and other intense
computational applications. The Cell processor uses
5Gbps links for the processor's FlexIO system interface
and 3.2Gbps links for the processor's XDRTM memory
interface. This per pin bandwidth translates into a system
interface with a bandwidth of 60GB/s and a memory
interface with a bandwidth of 25.6GB/s, respectively.
-
Evaluation of Test Measures for LNA Production Testing Using a Multinormal Statistical Model [p. 731]
-
J. Tongbong, S. Mir and J.L. Carbonero
For Design-For-Test (DFT) purposes, analogue and
mixed-signal testing has to cope with the difficulty of test
evaluation before production. This paper aims at evaluating
test measures for RF components in order to optimize
production test sets and thus reduce test cost. For this, we
have first developed a statistical model of the performances
and possible test measures of the Circuit Under Test (a Low
Noise Amplifier). The statistical multi-normal model is
derived from data obtained using Monte-Carlo circuit
simulation (five hundred iterations). This statistical model is
then used to generate a larger circuit population (one million
instances) from which test metrics can be estimated with
ppm precision at the design stage, considering just process
deviations. With the use of this model, a trade-off between
defect level and yield loss resulting from process deviations
is used to set test limits. After fixing test limits, we have
carried out a fault simulation campaign to verify the
suitability of the different test measurements, targeting both
catastrophic and single parametric faults. Catastrophic faults
are modelled by shorts and opens. A parametric fault is
defined as the minimum value of a physical parameter that
causes a specification to be violated. Test metrics are then
evaluated for the LNA case-study. As a result, test metrics
for functional measurements such as S-parameters and
Noise Figure are compared with low cost test measurements
such as RMS and peak-to-peak current consumption and
output voltage, input/output impedance, and the correlation
between current consumption and output voltage.
Organizers/Moderators: B. Courtois, TIMA Laboratory, FR; I. O'Connor, Ecole Centrale de Lyon, FR
-
Heterogeneous Systems on Chip and Systems in Package [p. 737]
-
I. O'Connor, B. Courtois, K. Chakrabarty, N. Delorme, M. Hampton, J. Hartung
This paper discusses several forms of heterogeneity in
systems on chip and systems in package. A means to
distinguish the various forms of heterogeneity is given,
with an estimation of the maturity of design and modeling
techniques with respect to various physical domains.
Industry-level MEMS integration, and more prospective
microfluidic biochip systems are considered at both
technological and EDA levels. Finally, specific flows for
signal abstraction heterogeneity in RF SiP and for
functional co-verification are discussed.
Moderators: E.M. Aboulhamid, Montreal U, CA; T. Austin, U of Michigan, US
-
Engineering Trust with Semantic Guardians [p. 743]
-
I. Wagner and V. Bertacco
The ability to guarantee the functional correctness of digital
integrated circuits and, in particular, complex microprocessors,
is a key task in the production of secure and trusted
systems. Unfortunately, this goal remains today an unfulfilled
challenge, as even the most straightforward practical
designs are released with latent bugs. Patching techniques
can repair some of these escaped bugs, however, they often
incur a performance overhead, and most importantly, they
can only be deployed after an escaped bug has been exposed
at the customer site. In this paper we present a novel approach
to guaranteeing correct system operation by deploying
a semantic guardian component. The semantic guardian
is an additional control logic block which is included in the
design, and can switch the microprocessor's mode of operation
from its normal, high-performance but error-prone
mode, to a a secure, formally verified safe mode, guaranteing
that the execution will be functionally correct. We explore
several frameworks where a selective use of the safe mode
can enhance the overall functional correctness of a processor.
Additionally, we observe through experimentation that semantic
guardians facilitate the trade-off between the design
validation effort and the performance and area cost of the final
secure product. The experimental results show that the
area cost and performance overheads of a semantic guardian
can be as small as 3.5% and 5%, respectively.
-
CATS: Cycle Accurate Transaction-driven Simulation with Multiple Processor Simulators [p. 749]
-
D. Kim, S. Ha and R. Gupta
This paper focuses on enhancing performance of cycle
accurate simulation with multiple processor simulators.
Simulation performance is determined by how often simulators
exchange events with one another and how accurately
simulators model their behavior. Previous techniques
have limited their applicability or sacrificed accuracy
for performance. In this paper, we notice that inaccuracy
comes from events which arrive between event exchange
boundaries. To solve the problem, we propose
cycle accurate transaction-driven simulation which maintains
event exchange boundaries at bus transactions but
compensates for accuracy. The proposed technique is implemented
in a publicly available CATS framework and
our experiment with 64 processors achieves 1.2M processor
cycles/s (200K instructions/s) which is faster than
other cycle accurate frameworks by an order of magnitude.
-
A One-Shot Configurable-Cache Tuner for Improved Energy and Performance [p. 755]
-
A. Gordon-Ross, P. Viana, F. Vahid, W. Najjar and E. Barros
We introduce a new non-intrusive on-chip cache-tuning hardware
module capable of accurately predicting the best configuration of
a configurable cache for an executing application. Previous
dynamic cache tuning approaches change the cache configuration
several times as part of the tuning search process, executing the
application using inferior configurations and temporarily causing
energy and performance overhead. The introduced tuner uses a
different approach, which non-intrusively collects data on
addresses issued by the microprocessor, analyzes that data to
predict the best cache configuration, and then updates the cache
to the new best configuration in "one-shot," without ever having
to examine inferior configurations. The result is less energy and
less performance overhead, meaning that cache tuning can be
applied more frequently. We show through experiments that the
one-shot cache tuner can reduce memory-access related energy
for instructions by 35% and comes within 4% of a previous
intrusive approach, and results in 4.6 times less energy overhead
and a 7.7 times speedup in tuning time compared to a previous
intrusive approach, at the main expense of 12% larger size.
-
Design Fault Directed Test Generation for Microprocessor Validation [p. 761]
-
D.A. Mathaikutty, S.K. Shukla, S.V. Kodakara, D. Lilja and A. Dingankar
Functional validation of modern microprocessors is an important
and complex problem. One of the problems in functional validation
is the generation of test cases that has higher potential to
find faults in the design. We propose a model based test generation
framework that generates tests for design fault classes inspired
from software validation. There are two main contributions in this
paper. Firstly, we propose a microprocessor modeling and test
generation framework that generates test suites to satisfy Modified
Condition Decision Coverage (MCDC), a structural coverage
metric that detects most of the classified design faults as well as
the remaining faults not covered by MCDC. Secondly, we show
that there exists good correlation between types of design faults
proposed by software validation and the errors/bugs reported in
case studies on microprocessor validation. We demonstrate the
framework by modeling and generating tests for the microarchitecture
of VESPA, a 32-bit microprocessor. In the results section,
we show that the tests generated using our framework's coverage
directed approach detects the fault classes with 100% coverage,
when compared to model-random test generation.
-
Impact of Description Language, Abstraction Layer, and Value Representation on
Simulation Performance [p. 767]
-
W. Ecker, V. Esen, L. Schönberg, T. Steininger M. Velten and M. Hull
In recent years other verification features than simulation
performance such as robustness and debugging gained increasing
impact on simulation language and tool selection. However,
fastest model execution speed is still priority number one for
many design and verification engineers. This can be seen in the
continuously growing interest in virtual prototypes and transaction
level modeling (TLM).
As part of the ongoing re-work modeling language strategies and
the world wide introduction of TLM, a detailed analysis of the
impact of description languages, abstraction layers and data types
on simulation performance is of high importance. For the
presented analysis, we considered five designs that have been
modeled in VHDL, Verilog, SystemVerilog, and SystemC, using
different value representations and coding styles, covering the
abstraction levels from functional to behavioral to RTL.
This paper presents our evaluation environment and several
interesting findings of our analysis. The most important results are
as follows: We found that HDL tool/language/abstraction
selection of RTL models impacts on the execution speed with a
factor of 4.4. We found that Verilog is on average 2x faster than
VHDL for RTL models. We found that SystemC results in 10x
slower RTL models than HDLs and surprisingly results in 2.6x
slower TLM1 PV models than SystemVerilog. And we found
finally that on average over all analyzed aspects SystemVerilog
models are executed fastest.
Moderators: D. Soudris, Thrace Democritus U, GR; M. Poncino, Politecnico di Torino, IT
-
Adaptive Power Management in Energy Harvesting Systems [p. 773]
-
C. Moser, L. Thiele, D. Brunelli and L. Benini
Recently, there has been a substantial interest in the design
of systems that receive their energy from regenerative
sources such as solar cells. In contrast to approaches that
attempt to minimize the power consumption we are concerned
with adapting parameters of the application such
that a maximal utility is obtained while respecting the limited
and time-varying amount of available energy. Instead of
solving the optimization problem on-line which may be prohibitively
complex in terms of running time and energy consumption,
we propose a parameterized specification and the
computation of a corresponding optimal on-line controller.
The efficiency of the new approach is demonstrated by experimental
results and measurements on a sensor node.
-
Stochastic Modeling and Optimization for Robust Power Management in a Partially
Observable System [p. 779]
-
Q. Qiu, Y. Tan and Q. Wu
As the hardware and software complexity grows, it is
unlikely for the power management hardware/software to
have a full observation of the entire system status. In this
paper, we propose a new modeling and optimization
technique based on partially observable Markov decision
process (POMDP) for robust power management, which
can achieve near-optimal power savings, even when only
partial system information is available. Three scenarios of
partial observations that may occur in an embedded system
are discussed and their modeling techniques are presented.
The experimental results show that, compared with power
management policy derived from traditional Markov
decision process model that assumes the system is fully
observable, the new power management technique gives
significantly better performance and energy tradeoff.
-
Efficient and Scalable Compiler-Directed Energy Optimization for Realtime Applications [p. 785]
-
P.-K. Huang and S. Ghiasi
We present a compilation technique that targets realtime
applications running on embedded processors with combined
dynamic voltage scaling (DVS) and adaptive body
biasing (ABB) capabilities. Considering the delay and energy
penalty of switching between operating modes of the
processor, our compiler judiciously inserts mode switch instructions
in selected locations of the code and generates
executable binary that is guaranteed to meet the deadline
constraint. More importantly, our algorithm runs very fast
and comes reasonably close to the theoretical limit of energy
optimization using DVS+ABB. At 65 nm technology,
we improve the energy dissipation of the generated code by
an average of 11.4% under deadline constraints. While our
technique's improvement in energy dissipation over conventional
DVS is marginal (3%) at 130nm, the average improvement
continues to grow to 4.7%, 8.8% and 15.4%
for 90nm, 65nm and 45nm technology nodes, respectively.
Compared to a recent ILP-based competitor, we improve
the runtime by more than three orders of magnitude, while
producing improved results.
-
Peripheral-Conscious Scheduling on Energy Minimization for Weakly Hard Real-time Systems [p. 791]
-
L. Niu and G. Quan
In this paper, we present a dynamic scheduling algorithm
to minimize the energy consumption by both the DVS processor
and peripheral devices in a weakly hard real-time
system. In our approach, we first use a new static approach
to partition real-time jobs into mandatory and optional part
to meet the weakly hard real-time constraints. We then
adopt an on-line approach that can effectively exploit the
run-time variations and reduce the preemption impacts to
leverage the energy saving performance. Extensive simulation
studies demonstrate that our approach can effectively
reduce the system-wide energy consumption while guaranteeing
the weakly hard constraints.
-
Task Scheduling under Performance Constraints for Reducing the Energy Consumption of GALS
Multi-Processor SoC [p. 797]
-
R. Watanabe, M. Kondo, M. Imai, H. Nakamura and T. Nanya
The present paper focuses on applications that are periodic
and have both latency and throughput constraints. For these
applications, pipeline scheduling is effective for reducing energy
consumption. Thus, the present paper proposes a pipelined
task scheduling method for minimizing the energy consumption of
GALS MP-SoC under latency and throughput constraints. First,
we model target GALS MP-SoC architecture and application
tasks. We then show that the energy optimization problem under
this model belongs to the class of Mixed-Integer Linear Programming.
Next, we propose a new scheduling method based on simulated
annealing for the purpose of solving this problem quickly. Finally,
experimental results demonstrate that the proposed method
achieves a significant energy reduction on a real application under
a practical architecture.
Moderators: W. Kruijtzer, NXP, NL; G. Martin, Tensilica, US
-
Instruction Trace Compression for Rapid Instruction Cache Simulation [p. 803]
-
A. Janapsatya, A. Ignjatovic, S. Parameswaran and J. Henkel
Modern Application Specific Instruction Set Processors (ASIPs) have
customizable caches, where the size, associativity and line size can
all be customized to suit a particular application. To find the best
cache size suited for a particular embedded system, the application(
s) is/are executed, traces obtained, and caches simulated. Typically,
program trace files can range from a few megabytes to several
gigabytes. Simulation of cache performance using large program
trace files is a time consuming process. In this paper, a novel instruction
cache simulation methodology that can operate directly on
a compressed program trace file without the need for decompression
is presented. This feature allowed our simulation methodology
to have an average speed up of 9.67 times compared to the existing
state of the art tool (Dinero IV cache simulator), for a range of
applications from the Mediabench suite.
-
Efficient Code Density through Look-up Table Compression [p. 809]
-
T. Bonny and J. Henkel
Code density is a major requirement in embedded system
design since it not only reduces the need for the scarce resource
memory but also implicitly improves further important
design parameters like power consumption and performance.
Within this paper we introduce a novel and efficient
hardware-supported approach that belongs to the group of
statistical compression schemes as it is based on Canonical
Huffman Coding. In particular, our scheme is the first to
also compress the necessary Look-up Tables that can become
significant in size if the application is large and/or
high compression is desired. Our scheme optimizes the
number of generated Look-up Tables to improve the compression
ratio. In average, we achieve compression ratios
as low as 49%(already including the overhead of the Lookup
Tables). Thereby, our scheme is entirely orthogonal to
approaches that take particularities of a certain instruction
set architecture into account. We have conducted evaluations
using a representative set of applications and have
applied it to three major embedded processor architectures,
namely ARM, MIPS and PowerPC.
-
Microarchitectural Support for Program Code Integrity Monitoring in Application-specific
Instruction Set Processors [p. 815]
-
Y. Fei and Z.J. Shi
Program code in a computer system can be altered either
by malicious security attacks or by various faults in microprocessors.
At the instruction level, all code modifications
are manifested as bit flips. In this work, we present a generalized
methodology for monitoring code integrity at run-time
in application-specific instruction set processors (ASIPs),
where both the instruction set architecture (ISA) and the underlying
microarchitecture can be customized for a particular
application domain. We embed monitoring microoperations
in machine instructions, thus the processor is augmented
with a hardware monitor automatically. The monitor
observes the processor's execution trace of basic blocks
at run-time, checks whether the execution trace aligns with
the expected program behavior, and signals any mismatches.
Since microoperations are at a lower software architecture
level than processor instructions, the microarchitectural support
for program code integrity monitoring is transparent to
upper software levels and no recompilation or modification
is needed for the program. Experimental results show that
our microarchitectural support can detect program code integrity
compromises with small area overhead and little performance
degradation.
-
Soft-core Processor Customization Using the Design of Experiments Paradigm [p. 821]
-
D. Sheldon, F. Vahid and S. Lonardi
Parameterized components are becoming more commonplace in
system design. The process of customizing parameter values for a
particular application, called tuning, can be a challenging task
for a designer. Here we focus on the problem of tuning a
parameterized soft-core microprocessor to achieve the best
performance on a particular application, subject to size
constraints. We map the tuning problem to a well-established
statistical paradigm called Design of Experiments (DoE), which
involves the design of a carefully selected set of experiments and
a sophisticated analysis that has the objective to extract the
maximum amount of information about the effects of the input
parameters on the experiment. We apply the DoE method to
analyze the relation between input parameters and the
performance of a soft-core microprocessor for a particular
application, using only a small number of synthesis/execution
runs. The information gained by the analysis in turn drives a
soft-core tuning heuristic. We show that using DoE to sort the
parameters in order of impact results in application speedups of
6x-17x versus an un-tuned base soft-core. When compared to a
previous single-factor tuning method, the DoE-based method
achieves 3x-6x application speedups, while requiring about the
same tuning runtime. We also show that tuning runtime can be
reduced by 40-45% by using predictive tuning methods already
built into a DoE tool.
-
Power Supply and Power Management in Ubicom[p. 827]
-
This session views the challenges for power supply and power management with
devices and system of ad hoc communication nature. The session highlights some of
the design aspects relevant in ubicom and their impact to the whole system design and
communication solutions.
Moderators: O. Deprez, Texas Instruments, FR; M. Heijligers, NXP IC-Lab, NL
-
From Algorithm to First 3.5G Call in Record Time . A Novel System Design Approach Based
on Virtual Prototyping and Its Consequences for Interdisciplinary System Design Teams [p. 828]
-
M. Brandenburg, A. Schöllhom, S. Heinen, J. Eckmüller and T. Eckart
Increasing system complexity not only in wireless
communications forces design teams to avoid errors
during the process of system refinement thereby keeping
ambiguities during system implementation at a minimum.
On the other hand the chosen system design approach has
to ensure that a system design project rapidly advances
through all stages of refinement from an algorithmic
model to a real "System on Chip" (SoC) while
maintaining backwards equivalence of the produced HW
and FW/SW code with the original algorithmic model.
This system design challenge also demands a new
interdisciplinary team approach encompassing all design
skills ranging from concept to HW and FW/SW
engineering as well as system verification to increase the
overlap in the system concept, implementation and
verification phase.
But how do these interdisciplinary teams cooperate
efficiently, as they are used to metaphorically "speak
different design languages"?
Resulting in an industry record development time for a
3.5G UMTS modem the employment of a novel system
design approach is shown which serves as common
system design language, avoiding the babylonian
language disaster of isolated engineering worlds.
The motivation for an increasing overlap of system
concept, implementation and verification phases is
obvious: it can save time (to market) in the magnitude of
several months or even more and thus drastically shorten
design cycles by parallel development of HW and FW/SW.
The proposed approach also helps to avoid costly
redesign cycles due to conceptual errors and optimizes
the quality of the developed system HW and FW/SW
thereby also substantially reducing system development
R&D costs.
-
Portable Multimedia SoC Design: A Global Challenge [p. 831]
-
M. Paganini, G. Kimmich, S. Ducrey, G. Caubit and V. Coeffe
The intrinsic capability brought by each new
technology node opens the way to a broad range of system
integration options and continuously enables new applications to
be integrated in a single device to the point that almost
everything seems possible.
In reality the difference between a successful design and a failure
resides today more then ever in the ability of the design team to
properly master all the critical design factors at once. In
essence, today's System on Chip design represent a
multidiscipline challenge that spans from Architecture through
Design to Test and finally mass production.
SoC design for portable applications has to cope with very
unique constraints that normally greatly challenge the ability of
an organization and most of the times of an entire Company to
fully master its industrialization capabilities and pushes
concurrent design to new limits.
In the end, only a well thought out Architecture followed by best
practices design techniques with a high level of understanding of
the manufacturing constraints and excellent logistics can result in
a device that can be produced in the volume required by the cell
phone industry today.
This paper will try to capture how these challenges have been
addressed to design the family of Application Processing Engines
named NomadikTM. The paper will specifically focus on the third
generation device labeled STn8815S22 where the integration
capabilities of silicon technology have been pared with those of
System in Package design to provide and extremely compact and
effective System on Chip for portable multimedia applications.
An overview of the main success factors and challenges will be
presented driving the reader from the Architecture conception
through the chip industrialization. Both Silicon design and
packaging design will be illustrated, highlighting those
techniques that made this incredible product a reality.
-
What If You Could Design Tomorrow's System Today? [p. 835]
-
N. Wingen
This paper highlights a series of proven concepts
aimed at facilitating the design of next generation systems.
Practical system design examples are examined and provide
insight on how to cope with today's complex design challenges.
Moderators: E. Larsson, Linkoping U, SE; D. Gizopoulos, Piraeus U, GR
-
Circuit-Level Modeling and Detection of Metallic Carbon Nanotube Defects in Carbon Nanotube FETs [p. 841]
-
H. Hashempour and F. Lombardi
Carbon Nanotube Field Effect Transistors (CNTFET)
are promising nano-scaled devices for implementing high
performance, very dense and low power circuits. The core
of a CNTFET is a carbon nanotube. Its conductance property
is determined by the so-called chirality of the tube;
chirality is difficult to control during manufacturing. This
results in conducting (metallic) nanotubes and defective
CNTFETs similar to stuck-on (SON or source-drain short)
faults, as encountered in classical MOS devices. This paper
studies this phenomenon by using layout information and
presents modeling and detection methodologies for nanoscaled
defects arising from the presence of metallic carbon
nanotubes. For CNTFET-based circuits (e.g. intramolecular),
these defects are analyzed using a traditional
stuck-at fault model. This analysis is applicable to primitive
and complex gates. Simulation results are presented for detecting
modeled metallic nanotube faults in CNTFETs using
a single stuck-at fault test set. A high coverage is achieved
(˜98%).
Keywords: Carbon Nanotube, CNT, CNTFET, Defect
Modeling, Fault Detection, Nanotechnology
-
Error Rate Reduction in DNA Self-Assembly by Non-Constant Monomer Concentrations and Profiling [p. 847]
-
B. Jang, Y.-B. Kim and F. Lombardi
This paper proposes a novel technique based on profiling
the monomers for reducing the error rate in DNA selfassembly.
This technique utilizes the average concentration
of the monomers (tiles) for a specific pattern as found by
profiling its growth. The validity of profiling and the large
difference in the concentrations of the monomers are shown
to be applicable to different tile sets. To evaluate the error
rate new Markov based models are proposed to account for
the different types of bonding (i.e. single, double and triple)
in the monomers as modification to the commonly assumed
kinetic trap model. A significant error rates reduction is
accomplished compared to a scheme with constant concentration
as commonly utilized under the kinetic trap model.
Simulation results are provided.
-
Design and DFT of a High-Speed Area-Efficient Embedded Asynchronous FIFO [p. 853]
-
P. Wielage, E.J. Marinissen, M. Altheimer and C. Wouters
Embedded First-In First-Out (FIFO) memories are increasingly used in many IC designs. We have created a new full-custom embedded
ripple-through FIFO module with asynchronous read and write clocks. The implementation is based on a micropipeline
architecture and is at least a factor two smaller than SRAM-based and standard-cell-based counterparts. This paper gives an
overview of the most important design features of the new FIFO module and describes its test and design-for-test approach.
-
Test Quality Analysis and Improvement for an Embedded Asynchronous FIFO [p. 859]
-
T. Dubois, M. Azimane, E. Larsson, E.J. Marinissen, P. Wielage and C. Wouters
Embedded First-In First-Out (FIFO) memories are increasingly used in many IC designs. We have created a new full-custom
embedded FIFO module with asynchronous read and write clocks, which is at least a factor two smaller and also faster than
SRAM-based and standard-cell-based counterparts. The detection qualities of the FIFO test for both hard and weak resistive
shorts and opens have been analyzed by an IFA-like method based on analog simulation. The defect coverage of the initial FIFO
test for shorts in the bit-cell matrix has been improved by inclusion of an additional data background and low-voltage testing; for
low-resistant shorts, 100% defect coverage is obtained. The defect coverage for opens has been improved by a new test procedure
which includes waiting periods.
-
Logic Level Fault Tolerance Approaches Targeting Nanoelectronics PLAs [p. 865]
-
W. Rao, A. Orailoglu and R. Karri
A regular structure and capability to implement arbitrary logic functions
in a two-level logic form have placed crossbar-based Programmable
Logic Arrays (PLAs) as promising implementation architectures
in the emerging nanoelectronics environment. Yet reliability
constitutes an important concern in the nanoelectronics environment,
necessitating a thorough investigation and its effective
augmentation for crossbar-based PLAs. We investigate in this paper
fault masking for crossbar-based nanoelectronics PLAs. Missing
nanoelectronics devices at the crosspoints have been observed
as a major source of faults in nanoelectronics crossbars. Based on
this observation, we present a class of fault masking approaches
exploiting logic tautology in two-level PLAs. The proposed approaches
enhance the reliability of nanoelectronics PLAs significantly at low hardware
cost.
Moderators: F. Fummi, Verona U, IT; M. Lajolo, NEC Laboratories, US
-
A Multi-Core Debug Platform for NoC-Based Systems [p. 870]
-
S. Tang and Q. Xu
Network-on-Chip (NoC) is generally regarded as the most promising
solution for the future on-chip communication scheme in gigascale
integrated circuits. As traditional debug architecture for busbased
systems is not readily applicable to identify bugs in NoC-based
systems, in this paper, we present a novel debug platform that supports
concurrent debug access to the cores under debug (CUDs) and
the NoC in a unified architecture. By introducing core-level debug
probes in between the CUDs and their network interfaces and a
system-level debug agent controlled by an off-chip multi-core debug
controller, the proposed debug platform provides in-depth analysis
features for NoC-based systems, such as NoC transaction analysis,
multi-core cross-triggering and global synchronized timestamping.
Therefore, the proposed solution is expected to facilitate the designers
to identify bugs in NoC-based systems more effectively and efficiently.
Experimental results show that the design-for-debug cost for
the proposed technique in terms of area and traffic requirements is
moderate1.
-
Seamless Hardware/Software Performance Co-Monitoring in a Codesign Simulation Environment
with RTOS Support [p. 876]
-
L. Moss, M. De Nanclas, L. Filion, S. Fontaine, G. Bois and M. Aboulhamid
Simulation monitoring tools are needed in
hardware/software codesign for performance debugging,
model validation and hardware/software partitioning
purposes. Existing tools are either hardware- or software-centric
and lack integrated and seamless co-monitoring.
This paper presents a system-level co-monitoring tool that
can monitor the computation and communication
activities of SystemC user modules, as well as bus,
memory and processor usage, on a variety of
hardware/software embedded configurations that may
include an RTOS. We also describe how performance
metrics are generated during or after simulation and
made accessible to users or external applications.
Finally, experimental results show that such comonitoring
does not disturb the simulation's internal
timing and only moderately increases the simulation's
wall clock run time (by 11-22% for hardware/software
partitioned architectures).
-
Incremental ABV for Functional Validation of TL-to-RTL Design Refinement [p. 882]
-
N. Bombieri, F. Fummi and G. Pravadelli
Transaction-level modeling (TLM) has been proposed as the leading
strategy to address the always increasing complexity of digital
systems. However, its introduction arouses a new challenge
for designers and verification engineers, since there are no mature
tools to automatically synthesize an RTL implementation from a
transaction-level (TL) design, thus manual refinements are mandatory.
In this context, the paper presents an incremental assertionbased
verification (ABV) methodology to check the correctness of
the TL-to-RTL refinement. The methodology relies on reusing assertions
and already checked code, and it is guided by an assertion
coverage metrics.
-
Efficient Testbench Code Synthesis for a Hardware Emulator System [p. 888]
-
I. Mavroidis and I. Papaefstathiou
The rising complexity of modern embedded
systems is causing a significant increase in the verification
effort required by hardware designers and software
developers, leading to the "design verification crisis", as it
is known among engineers. Today's verification challenges
require powerful testbenches and high-performance
simulation solutions such as Hardware Simulation
Accelerators and Hardware Emulators that have been in
use in hardware and electronic system design centers for
approximately the last decade. In particular, in order to
accelerate functional simulation, hardware emulation is
used so as to offload calculation-intensive tasks from the
software simulator. However, the communication overhead
between the software simulator and hardware emulator is
becoming a new critical bottleneck. We tackle this problem
by partitioning the code running on the software simulator
into two sections: the testbench HDL (Hardware
Description Language) code that communicates directly
with the Design Under Test (DUT) and the rest C-like
testbench code. The former section is transformed into
synthesizable code while the latter runs in a general
purpose CPU. Our experiments demonstrate that the
proposed method reduces the communication overhead by a
factor of about 5 compared to a conventional hardware
emulated simulation.
-
Implementation of a Transaction Level Assertion Framework in SystemC [p. 894]
-
W. Ecker, V. Esen, T. Steininger, M. Velten and M. Hull
Current hardware design and verification methodologies
reflect a trend towards abstraction levels higher than RTL,
referred to as transaction level (TL). Since transaction level
models (TLMs) are used for early prototyping and as reference
models for the verification of their RTL representation,
the quality assurance of TLMs is vital. Assertion based verification
(ABV) of RTL models has improved quality assurance
of IP blocks and SoC systems to a great extent. Since
mapping of an RTL ABV methodology to TL poses severe
problems due to different design paradigms, current ABV
approaches need extensions towards TL. In this paper we
present a prototype implementation of a TL assertion framework
using SystemC which is currently the de facto standard
for system modeling.
-
Automatic Generation of Functional Coverage Models from Behavioral Verilog Descriptions [p. 900]
-
S. Verma, I.G. Harris and K. Ramineni
As an industrial practice, the functional coverage models are developed based on a
high-level specification of the Design Under Verification (DUV). However, in the
course of implementation a designer makes specific choices which may not be reflectedwell in a functional coverage model developed entirely from a high-level specification. We present a method to automatically generate implementation-aware coverage
models based on the static analysis of a HDL description of the DUV. Experimental
results show that the functional coverage models generated using our technique correlate well with the detection of randomly injected errors into a design.
Moderators: P.J. Mosterman, The MathWorks, Inc, US; H. Giese, Paderborn U, DE
-
Compositional Specification of Behavioral Semantics [p. 906]
-
K. Chen, J. Sztipanovits and S. Neema
An emerging common trend in model-based design
of embedded software and systems is the adoption of
Domain-Specific Modeling Languages (DSMLs). While
abstract syntax metamodeling enables the rapid and
inexpensive development of DSMLs, the specification
of DSML semantics is still a hard problem. In previous
work, we have developed methods and tools for the
semantic anchoring of DSMLs. Semantic anchoring
introduces a set of reusable "semantic units" that
provide reference semantics for basic behavioral
categories using the Abstract State Machine (ASM)
framework. In this paper, we extend the semantic
anchoring framework to heterogeneous behaviors by
developing a method for the composition of semantic
units. Semantic unit composition reduces the required
effort from DSML designers and improves the quality
of the specification. The proposed method is
demonstrated through a case study.
-
Performance Analysis of Multimedia Applications Using Correlated Streams [p. 912]
-
K. Huang, L. Thiele, T. Stefanov and E. Deprettere
In modern embedded systems, data streams are often partitioned
into separate sub-streams which are processed on
parallel hardware components. To analyze the performance
of these systems with high accuracy, correlations between
event streams must be taken into account. No methods are
known so far that are able to model such a scenario with the
desired accuracy. In this paper, we present a new approach
to analyze correlations and we embed this analysis method
into a well-established modular performance analysis framework.
The presented approach enables system-level performance
analysis of complete systems by taking into account
stream correlations and blocking-read semantics. Experimental
results on a hardware-software prototyping system
are provided that show the accuracy of the analysis in a
practical application.
-
Simulation Platform for UHF RFID [p. 918]
-
V. Derbek, C. Steger, R. Weiβ, D. Wischounig, J. Preishuber-Pfluegl and M. Pistauer
Developing modern integrated and embedded systems
require well-designed processes to ensure flexibility and independency.
These features are related to exchangeability
of hardware targets and to the ability of choosing the
target at a very late stage in the implementation process.
Especially in the field of ultra high frequency radio frequency
identification (UHF RFID) the model-based design
approach leads to expected results. Beside a clear design
process, which is applied in this work to build the required
system architecture, the scope for UHFRFID simulations
is defined and an extendable platform based on
The MathWorks Matlab Simulink® is developed. This simulation
platform, based on a multi-processor hardware target,
using a Texas Instruments TMS320C6416 digital signal
processor is able to run UHFRFID tag simulations of
very high complexity. The highest effort is made to ensure
flexibility to handle future simulation models on the
same hardware target, realized by the continuous design
and implementation flow of this platform based on modelbased
design.
-
Tool-Support for the Analysis of Hybrid Systems and Models [p. 924]
-
A. Bauer, M. Pister and M. Tautschnig
This paper introduces a method and tool-support for the
automatic analysis and verification of hybrid and embedded
control systems, whose continuous dynamics are often
modelled using MATLAB/Simulink. The method is based
upon converting system models into the uniform input language
of our efficient multi-domain constraint solving library,
ABSOLVER, which is then used for subsequent analysis.
Basically, ABSOLVER is an extensible SMT-solver
which addresses mixed Boolean and (nonlinear) arithmetic
constraint problems as they appear in the design of hybrid
control systems. It allows the integration and semantic
connection of various domain specific solvers via a logical
circuit, such that almost arbitrary multi-domain constraint
problems can be formulated and solved. Its design has been
tailored for extensibility, and thus facilitates the reuse of
expert knowledge, in that the most appropriate solver for
a given task can be integrated and used. As such the only
constraint over the problem domain is the capability of the
employed solvers. Our approach to systems verification has
been validated in an industrial case study using the model of
a car's steering control system. However, additional benchmarks
show that other hard instances of problems could
also be solved by ABSOLVER in respectable time, and that
for some instances, ABSOLVER's approach was the only
means of solving a problem at all.
-
Automatic Model Generation for Black Box Real-Time Systems [p. 930]
-
T.H. Feng, L. Wang, W. Zheng, S. Kanajan and S.A. Seshia
Embedded systems are often assembled from black box
components. System-level analyses, including verification
and timing analysis, typically assume the system description,
such as RTL or source code, as an input. There is
therefore a need to automatically generate formal models
of black box components to facilitate analysis.
We propose a new method to generate models of realtime
embedded systems based on machine learning from execution
traces, under a given hypothesis about the system's
model of computation. Our technique is based on a novel
formulation of the model generation problem as learning a
dependency graph that indicates partial ordering between
tasks. Tests based on an industry case study demonstrate
that the learning algorithm can scale up and that the deduced
system model accurately reflects dependencies between
tasks in the original design. These dependencies help
us formally prove properties of the system and also extract
data dependencies that are not explicitly stated in the specifications
of black box components.
Organizers: N. Nandra, Synopsys, US; R. Wittmann, Nokia, DE
Moderator: G. Gielen, KU Leuven, BE
-
Life Begins at 65 - Unless You Are Mixed Signal? [p. 936]
-
R. Wittmann, N. Nandra, J. Kunkel, M. Vanzi, J. Franca, H.-J. Wassener, C. Münker
The old school of analog designers, exemplified by
pioneer Bob Pease, is becoming an extinct species. But
the demand for analog/mixed-signal IP blocks has
never been greater, especially at 65 nm and below. Can
this demand be met by using externally designed 3rd
party analog/mixed-signal IP? Or is the
implementation of revolutionary changes to traditional
work flows and analog design processes a suitable
option? Which solutions that help in increasing design
efficiency are currently on the table? In the future,
which side of the table will analog designers of Bob
Pease's generation sit: the IP provider or the chip
company? Or are their skills redundant for the 65 nm
analog design challenges?
Moderators: M. Coppolla, STMicroelectronics, IT; P. Ienne, EPFL Lausanne, CH
-
Routing Table Minimization for Irregular Mesh NoCs [p. 942]
-
E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny
The majority of current Network on Chip (NoC) architectures
employ mesh topology and use simple static routing, to reduce
power and area. However, regular mesh topology is unrealistic
due to variations in module sizes and shapes, and is not suitable
for application-specific NoCs. Consequently, simplistic routing
techniques such as XY routing are inadequate, raising the need for
low cost alternatives which can work in irregular mesh networks.
In this paper we present a novel technique for reducing the total
hardware cost of routing tables for both source and distributed
routing approaches. The proposed technique is based on applying
a fixed routing function combined with minimal deviation tables
that are used only when the routing decisions for a given
destination deviate from the predefined routing function. We
apply this methodology to compare three hardware efficient
routing methods for irregular mesh topology NoCs. For each
method, we develop path selection algorithms that minimize the
overall cost of routing tables. Finally, we demonstrate by
simulations on random and specific real application network
instances a significant cost saving compared to standard solutions,
and examine the scaling of cost savings with growing NoC size.
-
Congestion-Controlled Best-Effort Communication for Networks-on-Chip [p. 948]
-
J.W. van den Brand, C. Ciordas, K. Goossens and T. Basten
Congestion has negative effects on network performance.
In this paper, a novel congestion control strategy
is presented for Networks-on-Chip (NoC). For this purpose
we introduce a new communication service, congestioncontrolled
best-effort (CCBE). The load offered to a CCBE
connection is controlled based on congestion measurements
in the NoC. Link utilization is monitored as a congestion
measure, and transported to a Model Predictive Controller
(MPC). Guaranteed bandwidth and latency connections in
the NoC are used for this, to assure progress of link utilization
data in a congested NoC. We also present a simple but effective
model for link utilization for the model-based predictions.
Experimental results show that the presented strategy is effective
and has reaction speeds of several microseconds which
is considered acceptable for realtime embedded systems.
-
Undisrupted Quality-of-Service during Reconfiguration of Multiple Applications in Networks on Chip [p. 954]
-
A. Hansson, M. Coenen and K. Goossens
Networks on Chip (NoC) have emerged as the design
paradigm for scalable System on Chip (SoC) communication
infrastructure. Due to convergence, a growing number
of applications are integrated on the same chip. When combined
, these applications result in use-cases with different
communication requirements. The NoC is configured per
use-case and traditionally all running applications are disrupted
during use-case transitions, even those continuing
operation.
In this paper we present a model that enables partial reconfiguration
of NoCs and a mapping algorithm that uses
the model to map multiple applications onto a NoC with
undisrupted Quality-of-Service during reconfiguration. The
performance of the methodology is verified by comparison
with existing solutions for several SoC designs. We apply
the algorithm to a mobile phone SoC with telecom, multimedia
and gaming applications, reducing NoC area by more
than 17% and power consumption by 50% compared to a
state-of-the-art approach.
Organizers: L. Anghel, TIMA Laboratory, FR; M.-L. Flottes, LIRMM, Montpellier, FR
Moderator Y. Zorian, Virage Logic, US
-
Testing in the Year 2020 [p. 960]
-
R. Galivanche, R. Kapur and A. Rubio
Testing today of a several hundred million transistor
System-on-Chip with analog, RF blocks, many processor
cores and tens of memories is a huge task. What will test
technology be like in year 2020 with hundreds of billions
of transistors on a single chip? Can we get there with
tweaks to today's technology? While the exact nature of
the circuit styles, architectural innovations and product
innovations in year 2020 are highly speculative at this
point, we examine the impact of likely design and process
technology trends on testing methods.
Moderators: P. Manet, U Catholique de Louvain, BE ; I. Söderquist, SAAB AB, Saab Avitronics, SE
-
Transaction Level Modeling of SCA Compliant Software Defined Radio Waveforms and
Platforms PIM/PSM [p. 966]
-
G. Gailliard, E. Nicollet, M. Sarlotte and F. Verdier
In the scope of the US Department of Defense (DoD)
Joint Tactical Radio System (JTRS) program, the
portability and reconfigurability needs of Software
Defined Radios (SDR) required by the Software
Communications Architecture (SCA) [1] can be resolved
thanks to Model Driven Architecture (MDA) and
component/container paradigm to address a heterogeneous
hardware and software architecture.
In this paper, we propose SystemC Transaction Level
Modelling (TLM) to simulate Platform Independent
Model (PIM) and Platform Specific Model (PSM) of
SDRs, while keeping the component/container approach
for applications portability. We show that SystemC 2.1
enables natively to simulate the waveform PIM specified
in UML to obtain an executable specification, which can
be reused to validate the SystemC TLM model of PSM.
This latter allows radio platform virtualisation and true
reuse of IPs models to validate earlier SDR waveforms
and platforms.
-
Event Driven Data Processing Architecture [p. 972]
-
I. Söderquist
This paper describes a data processing architecture
where events and time are in focus. This differs from
traditional von Neumann and data flow architectures.
New instruction codes are defined and special circuitry is
introduced to express and execute event and time
operations. This results in reconfigurable software
controlled functionality together with real-time
performance comparable to dedicated VLSI solutions.
The architecture is demonstrated in a real-time radar
jammer application. The architecture is promising also
for applications as routers and network processors. A
prototype system on silicon (SoC), complete with signal
memory, instruction memory, four processing units in
parallel and interfaces for digitized signals and host
computer, is fabricated in 0.35 μm standard CMOS. Time
events of signal data on two simultaneous 8-bit links can
be programmed with a time resolution of one clock
period. Measurements verified correct function and
performance above 400 MHz clock frequency at 3.3 Volt
supply. Power consumption is 3.6-Watt @320 MHz.
-
Reconfigurable System-on-Chip Data Processing Units for Space Imaging Instruments [p. 977]
-
B. Fiethe, H. Michalik, C. Dierker, B. Osterloh and G. Zhou
Individual Data Processing Units (DPUs) are
commonly used for operational control and specific data
processing of scientific space instruments. To overcome
the limitations of traditional rad-hard or fully commercial
design approaches, a System-on-Chip (SoC) solution
based on state-of-the-art FPGA is introduced. This design
has been successfully demonstrated in space on Venus
Express. From this, a reconfigurable DPU design for
future advanced imaging sensors is derived using
embedded processing cores. In addition, a SoC design
variant is presented based on recently available FPGA
technology with integrated hardwired processor, which is
capable to support also high end payload applications.
-
Enabling Certification for Dynamic Partial Reconfiguration Using a Minimal Flow [p. 983]
-
B. Rousseau, P. Manet, D. Galerin, D. Merkenbraeck, J.-D. Legat, F. Dedeken and Y. Gabriel
As the trend in reconfigurable electronics goes towards
strong integration, FPGA devices are becoming more and
more interesting. They are already used for safety-critical
applications such as avionics [9]. Latest FPGA's also enable
new techniques such as dynamic partial reconfiguration
(DPR), allowing new possibilities in terms of performance
and flexibility. Their use in safety-critical systems
is considered as impossible nowadays since they must be
strictly validated, and DPR brings many new issues. Indeed,
the tools used for DPR must be certified, which is
barely impossible for the current DPR tools provided by
the vendors. We have developed a simple flow upon the
usual static one for Xilinx FPGA's that does not require
any support of the vendor tools for DPR. This lessens the
complexity of tools certification, and make a step towards
enabling the certification of DPR for safety-critical applications.
Moreover, under strong hypotheses, and by using
safe design principles, we show how the complexity of certifying
DPR can be reduced.
-
Identification of Process/Design Issues during 0.18 μm Technology Qualification for
Space Application [p. 989]
-
J. Ferrigno, P. Perdu, K. Sanchez and D. Lewis
Optical techniques (light emission and laser
stimulation techniques) are routinely used to evaluate
defects on specific component for space applications.
Just one anomaly on one component could have
catastrophic consequences on satellites. We must
analyse any kind of fault of the device whatever the
origin of thus fault is. It can be design, designprocess,
process or end user related... At the early stage of an analysis, choosing the right
technique is an increasingly complex task. In some
cases, one technique may bring value but not the
others. Using a 180nm test structure device, we will
present results showing the complementarity of
Emission Microscopy (EMMI), Time-Resolved
Emission (TRE) and Dynamic Laser Stimulation
(DLS) in order to help debug engineers to choose the
right approach. This complementarity gives us ability
to strengthen hypothesises before any kind of physical
analysis.
-
RECOPS: Reconfiguring Programmable Devices for Military Hardware Electronics [p. 994]
-
P. Manet, D. Maufroid, L. Tosi, M. Di Ciano, O. Mulertt, Y. Gabriel, J.-D. Legat, D. Aulagnier,
C. Gamrat, R. Liberati and V. La Barba
This paper presents the RECOPS project that aims to
study the use of reconfiguration in military applications.
The project explores the new potentials and possibilities
offered by reconfigurable components like FPGA. It
identifies specificities related to the use of this technology
in military applications and proposes solutions to support
them. Specific techniques like dynamic reconfiguration or
high speed serial I/Os are also covered.
The paper gives a description of the project and then
presents preliminary results on the advantages and
impacts of using reconfiguration in military applications.
It also gives a synthetic view of the needs and challenges
that need to face this technology to be integrated in
professional and military electronics applications. They
are based on a study made over a broad range of seven
demonstrators covering most of the fields of military
applications.
Keywords: reconfiguration, FPGA, defense, military,
dynamic reconfiguration, partial reconfiguration,
reconfigurable computing, high speed I/O.
Moderators: F. Salice, Politecnico di Milano, IT; P. Sanchéz, Cantabria U, ES
-
WAVSTAN: Waveform Based Variational Static Timing Analysis [p. 1000]
-
S.K Tiwary and J.R. Phillips
We present a waveform based variational static
timing analysis methodology. It is a timing paradigm that lies
midway between convention static delay approximations and full
dynamic (SPICE-level) analysis. The core idea is to break the
modulation of waveforms processed by a circuit into two parts:
(a) non-linear circuit elements e.g., transistors, diodes etc. and
(b) linear elements: transmission line, RLC network etc. The
non-linear and linear parts of the circuit are then solved using a
combination of current-source modeling, model order reduction
methodology, perturbation analysis and learning-based Galerkin
methods which helps us get SPICE-like accuracies. The proposed
method is potentially as robust and 10-20X faster than currentsource
based gate modeling methodologies.
-
Rapid and Accurate Latch Characterization via Direct Newton Solution of Setup/Hold Times [p. 1006]
-
S. Srivastava and J. Roychowdhury
Characterizing setup/hold times of latches and registers, a
crucial component for achieving timing closure of large digital
designs, typically occupies months of computation in industries
such as Intel and IBM. We present a novel approach to speed
up latch characterization by formulating the setup/hold time
problem as a scalar nonlinear equation h(τ) = 0 derived using
state-transition functions, and then solving this equation by
Newton-Raphson (NR). The local quadratic convergence of NR
results in rapid improvements in accuracy at every iteration,
thereby significantly reducing the computation needed for accurate
determination of setup/hold times. We validate the fast
convergence and computational advantage of the new method on
transmission gate and C2MOS latch/register structures, obtaining
speedups of 4-10x over the current standard of binary search.
-
Temperature and Voltage Aware Timing Analysis: Application to Voltage Drops [p. 1012]
-
B. Lasbouygues, R. Wilson, N. Azemard and P. Maurine
In the nanometer era, the physical verification of CMOS
digital circuit becomes a complex task. Designers must
account of new factors that impose a significant change in
validation methods. One of these major changes in timing
verification to handle process variation lies in the
progressive development of statistical static timing engines.
However the statistical approach cannot capture accurately
the deterministic variations of both the voltage and
temperature variations. Therefore, we define a novel
method, based on non-linear derating coefficients, to
account of these environmental variations. Based on
temperature and voltage drop CAD tool reports, this method
allows computing the delay of logical paths considering
more realistic operating conditions for each cell. Application
is given to the analysis of voltage drop effects on timings.
-
Accurate Timing Analysis Using SAT and Pattern-Dependent Delay Models [p. 1018]
-
D. Tadesse, D. Sheffield, E. Lenge, R.I. Bahar and J. Grodstein
Accurate delay modeling beyond static models is critical to
garnering better correlation with post-silicon analysis. Furthermore,
post-silicon timing validation requires a pattern-dependent
timing model to generate patterns. To address these issues, we
propose a timing analysis tool that integrates a data-dependent
delay model into its analysis. Our approach solves for the delay
by using the concept of circuit unrolling and formulation of timing
questions as decision problems for input into a SAT solver. The effectivness
and validity of the proposed methodology is illustrated
through experiments on benchmark circuits.
Moderators: S. van Loo, Philips Research, NL; H. De Groot, European Microsoft Innovation Centre, DE
-
CARAT: A Toolkit for Design and Performance Analysis of Component-Based Embedded Systems [p. 1024]
-
E. Bondarev, M. Chaudron and P.H.N. de With
Solid frameworks and toolkits for design and analysis of
embedded systems are of high importance, since they enable
early reasoning about critical properties of a system. This
paper presents a software toolkit that supports the design
and performance analysis of real-time component-based
software architectures deployed on heterogeneous multiprocessor
platforms. The tooling environment contains a set
of integrated tools for (a) component storage and retrieval,
(b) graphics-based design of software and hardware architectures,
(c) performance analysis of the designed architectures
and, (d) automated code generation. The cornerstone
of the toolkit is a performance analysis framework that automates
composition of the individual component models
into a system executable model, allows simulation of the
system model and gives design-time predictions of key performance
properties like response time, data throughput,
and usage of hardware resources. We illustrate the efficiency
of this toolkit on a Car Radio Navigation benchmark
system.
-
Modeling and Simulation Alternatives for the Design of Networked Embedded Systems [p. 1030]
-
E. Alessio, F. Fummi, D. Quaglia and M. Turolla
This paper addresses the problem of modeling and simulating
large set of heterogeneous networked embedded systems
which cooperate to build cost-efficient, reliable, secure
and scalable applications. The purpose of this task
is an application-driven top-down design flow which starts
from application requirements and then progressively decides
the general architecture of the system and the type and
structure of its HW, SW and network components. In the
past, a considerable research effort has been done to create
specific tools for each design domain "software, hardware
and network", and to integrate them for data exchange between
models and their joint simulation. However, the advantages
and drawbacks of different combinations of tools
in the various stages of the design flow have not been discussed.
The paper describes and discusses how to combine
different modeling tools to provide different modeling and
simulation alternatives for the design of networked embedded
systems devoted to complex distributed applications.
The problem is faced both theoretically and practically with
a real application derived from a European project.
-
Middleware Design Optimization of Wireless Protocols Based on the Exploitation of Dynamic
Input Patterns [p. 1036]
-
S. Mamagkakis, D. Soudris and F. Catthoor
Today, wireless networks are moving big amounts of data
between mobile devices, which have to work in an ubiquitous
computing environment, which perpetually changes
at run-time (i.e., nodes log on and off, varied user activity,
etc.). These changes introduce problems that can not
be fully analyzed at design-time and require dynamic (runtime)
solutions. These solutions are implemented with the
use of run-time resource management at the middleware
level for a wide variety of embedded systems. In this paper,
we motivate and propose the characterization of the
dynamic inputs of wireless protocols (e.g., input to the IEEE
802.11b protocol coming from IPv4 data fragmentation).
Thus, through statistical analysis we derive patterns that
will guide our optimization process of the middleware for
run-time resource management design. We assess the effectiveness
of our approach with inputs of 18 real life case
studies of wireless networks. Finally, we show up to 81.97%
increase in the performance of the proposed design solution
compared to the state-of-the-art solutions, without compromising
memory footprint or energy consumption.
-
Lightweight Middleware for Seamless HW-SW Interoperability, with Application to
Wireless Sensor Networks [p. 1042]
-
F.J. Villanueva, D. Villa, F. Moya, J. Barba, F. Rincón and J.C. López
HW-SW interoperability by means of standard distributed object middlewares
has been proved to be useful in the design of new and challenging applications
for ubiquitous computing and ambient intelligence environments. Wireless sensor
networks are considered to be essential for the proper deployment of these
applications, but they impose new constraints in the design of the corresponding
communication infrastructure: low-cost middleware implementations that can fit
into tiny wireless devices are needed. In this paper, a novel approach for
the development of pervasive environments based on an ultra low-cost implementation
of standard distributed object middlewares (such as CORBA or ICE) is presented.
A fully functional prototype supporting full interoperability with ZeroC ICE
is described in detail. Available implementations range from the smallest
microcontrollers in the market, to the tiniest embedded Java virtual machines,
and even a low-end FPGA.
-
A Middleware-centric Design Flow for Networked Embedded Systems [p. 1048]
-
F. Fummi, G. Perbellini, R. Pietrangeli and D. Quaglia
The paper focuses on the design of networked embedded
systems which cooperate to provide complex distributed applications.
A milestone in the effort of simplifying the implementation
of such applications has been the introduction
of a service layer, named middleware, which abstracts from
the peculiarities of the operating system and HW components.
However, the presence of the middleware has not
been yet introduced in the design flow as an explicit dimension.
This work presents an abstract model of middleware
supporting different programming paradigms; it can
be used as component in the design flow and allows to simulate
and develop the application without doing premature
assumptions on the actual HW/SW platform. At the end of
the design flow the abstract middleware can be mapped to
an actual middleware. The methodology has been analyzed
both theoretically and practically with the actual application
on a wireless sensor network.
Moderators: J. Henkel, Karlsruhe U, DE; A. Macii, Politecnico di Torino, IT
-
Dynamic Reconfiguration in Sensor Networks with Regenerative Energy Sources [p. 1054]
-
A. Nahapetian, P. Lombardo, A. Acquaviva, L. Benini and M. Sarrafzadeh
In highly power constrained sensor networks, harvesting energy
from the environment makes prolonged or even perpetual
execution feasible. In such energy harvesting systems, energy
sources are characterized as being regenerative. Regenerative
energy sources fundamentally change the problem of power
scheduling for embedded devices. Instead of the problem being
one of maximizing the lifetime of the system given a total amount
of energy, as in traditional battery powered devices, the problem
becomes one of preventing energy depletion at any given time.
Coupling relatively computationally intensive applications, such
as video processing applications, with the constrained FPGAs
that are feasible on power constrained embedded systems, makes
dynamic reconfiguration essential. It provides the speed
comparable to a hardware implementation, but it also allows the
dynamic reconfiguration to meet the multiple application needs of
the system. Different applications can be loaded on the FPGA, as
the system's needs change over time. The problem becomes how
to schedule the dynamic reconfiguration to appropriately make
use of the regenerative energy source, to ensure the proper
availability of energy for the system over time.
In this paper, we present a methodology for carrying out dynamic
reconfiguration for regenerative energy sources, based on
statistical analysis of tasks and supply energy. The approach is
evaluated through extensive simulations. Additionally, we have
evaluated our implementation on our regenerative energy,
dynamically reconfigurable prototype, known as the MicrelEye.
Our approach is shown to miss 57.7% less deadlines on average
than the current approach for reconfiguration with regenerative
energy sources.
-
Dynamic Power Management under Uncertain Information [p. 1060]
-
H. Jung and M. Pedram
This paper tackles the problem of dynamic power management
(DPM) in nanoscale CMOS design technologies that are typically
affected by increasing levels of process, voltage, and temperature
(PVT) variations and fluctuations. This uncertainty significantly
undermines the accuracy and effectiveness of traditional DPM
approaches. More specifically, we propose a stochastic framework
to improve the accuracy of decision making in power management,
while considering the manufacturing process and/or design
induced uncertainties. A key characteristic of the framework is
that uncertainties are effectively captured by a partially
observable semi-Markov decision process. As a result, the
proposed framework brings the underlying probabilistic PVT
effects to the forefront of power management policy determination.
Experimental results with a RISC processor demonstrate the
effectiveness of the technique and show that our proposed
variability-aware power management technique ensures robust
system-wide energy savings under probabilistic variations.
-
Very Wide Register: An Asymmetric Register File Organization for Low Power Embedded Processors [p. 1066]
-
P. Raghavan, A. Lambrechts, M. Jayapala, F. Catthoor, D. Verkest and H. Corporaal
In current embedded systems processors, multi-ported
register files are one of the most power hungry parts of
the processor, even when they are clustered. This paper
presents a novel register file architecture, which has single
ported cells and asymmetric interfaces to the memory
and to the datapath. Several realistic kernels from the TI
DSP benchmark and from Software Defined Radio (SDR)
are mapped on the architecture. A complete physical design
of the architecture is done in TSMC 90nm technology.
The novel architecture presented is shown to obtain energy
gains of upto 10X with respect to conventional multi-ported
register file over the different benchmarks.
-
Single-ended Coding Techniques for Off-chip Interconnects to Commodity Memory [p. 1072]
-
M. Choudhury, K. Ringgenberg, S. Rixner and K. Mohanram
This paper introduces a class of single-ended coding schemes to
reduce off-chip interconnect energy consumption. State-of-the-art
codes for processor-memory off-chip interfaces require the transmitter
and receiver (memory controller and memory) to collaborate
using current and previously transmitted values to encode and decode
data. Modern embedded systems, however, cannot afford to
use such double-ended codes that require specialized memories to
participate in the code. In contrast, a single-ended code enables
the memory controller to encode data stored in memory and subsequently
decode that data when it is retrieved, allowing the use
of commodity memories. In this paper, single-ended codes are
presented that assign limited-weight codewords using trace-based
mapping techniques. Simulation results show that such codes can
reduce the energy consumption of an uncoded off-chip interconnect
by up to 42.5%.
-
PowerQuest: Trace Driven Data Mining for Power Optimization [p. 1078]
-
P. Babighian, G. Kamhi and M. Vardi
We introduce a general framework, called PowerQuest, with the
primary goal of extracting "interesting" dynamic invariants from
a given simulation-trace database, and applying it to the powerreduction
problem through detection of gating conditions.
PowerQuest adopts machine-learning techniques for data mining.
The advantages of PowerQuest in comparison with other state-ofthe-
art Dynamic Power Management (DPM) techniques are: 1)
Quality of ODC conditions for gating 2) Minimization of extra
logic added for gating. We demonstrate the validity of our
approach in reducing power through experimental results using
ITC99 benchmarks and real-life microprocessor test cases. We
present up to 22.7 % power reduction in comparison with other
DPM techniques.
Moderators: S. Murali, Stanford U, US; L. Carloni, UCB, ES
-
(408) System Level Assessment of an Optical NoC in an MPSoC Platform [p. 1084]
-
M. Brière, B. Girodias, Y. Bouchebaba, G. Nicolescu, F. Mieyeville, F. Gaffiot and I. O'Connor
In the near future, Multi-Processor Systems-on-Chip
(MPSoC) will become the main thrust driving the
evolution of integrated circuits. MPSoCs introduce new
challenges, mainly due to growing communication
through their interconnect structure. Current electrical
interconnects will face hard challenges to overcome such
data flows. Integrated optical interconnect is a potential
technological improvement to reduce these problems. The
main contributions of this paper are i) the optical network
integration in a system-level MPSoC platform and ii) the
quantitative evaluation of optical interconnect for MPSoC
design using a multimedia application.
-
(142) Systematic Comparison between the Asynchronous and the Multi-Synchronous Implementations
of a Network on Chip Architecture [p. 1090]
-
A. Sheibanyrad, I. Miro Panades and A. Greiner
In this paper we present a systematic comparison between two
different implementations of a distributed Network on Chip: fully
asynchronous and multi-synchronous. The NoC architecture has
been designed to be used in a Globally Asynchronous Locally
Synchronous clusterized Multi Processors System on Chip. The 5
relevant parameters are Silicon Area, Network Saturation
Threshold, Communication Throughput, Packet Latency and
Power Consumption. Both architectures have been physically
implemented and simulated by SystemC/VHDL co-simulation.
The electrical parameters have also been evaluated by post
layout SPICE simulation for a 90nm CMOS fabrication process,
taking into account the long wire effects.
-
(768) Analytical Router Modeling for Networks-on-Chip Performance Analysis [p. 1096]
-
U.Y. Ogras and R. Marculescu
Networks-on-Chip (NoCs) have recently emerged as a scalable
alternative to classical bus and point-to-point architectures. To
date, performance evaluation of NoC designs is largely based on
simulation which, besides being extremely slow, provides little
insight on how different design parameters affect the actual network
performance. Therefore, it is practically impossible to use
simulation for optimization purposes. In this paper, we first
present a generalized router model and then utilize this novel
model for doing NoC performance analysis. The proposed model
can be used not only to obtain fast and accurate performance
estimates, but also to guide the NoC design process within an
optimization loop. The accuracy of our approach and its practical
use is illustrated through extensive simulation results.
-
(374) Hard- and Software Modularity of the NOVA MPSoC Platform [p. 1102]
-
C. Sauer, M. Gries and S. Dirk
The Network-Optimized Versatile Architecture Platform
(NOVA) encapsulates embedded cores, tightly and
loosely coupled coprocessors, on-chip memories, and I/O
interfaces by special sockets that provide a common
packet passing and communication infrastructure. To ease
the programming of the heterogeneous multiprocessor
target for the application developer, a component based
framework is used for describing packet processing applications
in a natural and productive way. Leveraging identical
application and hardware communication semantics,
code generators and off-the-shelf tool chains can automate
the software implementation process. Using a prototype
with four processing cores we quantify the overhead of
modularity and programmability for the platform.
Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: S. Prudhomme, Airbus, FR
-
The Methodological and Technological Dimensions of Technology Transfer for Embedded Systems
in Aeronautics and Space [p. 1108]
-
T. Pardessus, H. Daembkes, and R. Arning
This tutorial is in two parts, to elaborate the two pillars of technology transfer in the context of the aeronautics
and space industry. The first part illustrates the methodological pillar, showing the state of the art in the
industrial approaches to technology transfer. The second part illustrates the technological pillar, giving an
overview of recent successes in technology transfer and emerging trends and opportunities, for both hardware
and software. These two pillars are further mirrored in the two technical sessions.
Moderators: R. Pacalet, ENST, FR; R. Locatelli, STMicroelectronics, FR
-
Energy Evaluation of Software Implementations of Block Ciphers under Memory Constraints [p. 1110]
-
J. GroΒschadl, S. Tillich, C. Rechberger, M. Hofmann and M. Medwed
Software implementations of modern block ciphers often
require large lookup tables along with code size increasing
optimizations like loop unrolling to reach peak performance
on general-purpose processors. Therefore, block ciphers are
difficult to implement efficiently on embedded devices like
cell phones or sensor nodes where run-time memory and
program ROM are scarce resources. In this paper we analyze
and compare the performance, energy consumption, runtime
memory requirements, and code size of the five block
ciphers RC6, Rijndael, Serpent, Twofish, and XTEA on the
StrongARM SA-1100 processor. Most previous evaluations
of block ciphers considered performance as the sole metric
of interest and did not care about memory requirements or
code size. In contrast to previous work, our study of the
performance and energy characteristics of block ciphers has
been conducted with "lightweight" implementations which
restrict the size of lookup tables to 1 kB and also impose
constraints on the code size. We found that Rijndael and
RC6 can be well optimized for high performance and energy
efficiency, while at the same time meeting the demand for
low memory (RAM and ROM) footprint. In addition, we
discuss the impact of key expansion and modes of operation
on the overall performance and energy consumption of each
block cipher. Our simulation results show that RC6 is the
most energy-efficient block cipher under memory constraints
and thus the best choice for resource-restricted devices.
-
An Area Optimized Reconfigurable Encryptor for AES-Rijndael [p. 1116]
-
M. Alam, S. Ray, D. Mukhopadhayay, S. Ghosh, D. RoyCowdhury and I. Sengupta
This paper presents a reconfigurable architecture of the
Advanced Encryption Standard (AES-Rijndael) cryptosystem.
The suggested reconfigurable architecture is capable
of handling all possible combinations of standard bit
lengths (128,192,256) of data and key. The fully rolled
inner-pipelined architecture ensures lesser hardware complexity.
The work develops a FSMD model based controller
which is ideal for such iterative implementation of
AES. S-boxes here have been implemented using combinational
logic over composite field arithmetic which completely
eliminates the need of any internal memory. The design
has been implemented on Xilinx Vertex XCV1000 and
0.18μ CMOS technology. The performance of the architecture
has been compared with existing results in the literature
and has been found to be the most compact implementations
of the AES algorithm.
-
Performance Aware Secure Code Partitioning [p. 1122]
-
S.H.K. Narayanan, M. Kandemir and R. Brooks
Many embedded applications exist where decisions are made using
sensitive information. A critical issue in such applications is to
ensure that data is accessed only by authorized computing entities.
In many scenarios, these entities do not rely on each other, yet they
need to work on a secure application in parallel to complete application
execution under the specified deadline. Our focus in this
paper is on compiler-guided secure code partitioning among a set of
hosts. The scenario targeted involves a set of hosts that want to execute
a secure embedded application in parallel. The various hosts
have different levels of access to the data structures manipulated
in the application. Our approach partitions the application among
the hosts such that the load imbalance across hosts is minimized to
reduce execution time while ensuring that no security leak occurs.
-
Energy and Execution Time Analysis of a Software-based Trusted Platform Module [p. 1128]
-
N. Aaraj, A. Raghunathan, S. Ravi and N.K. Jha
Trusted platforms have been proposed as a promising approach to enhance
the security of general-purpose computing systems. However, for
many resource-constrained embedded systems, the size and cost overheads
of a separate Trusted Platform Module (TPM) chip are not acceptable.
One alternative is to use a software-based TPM (SW-TPM),
which implements TPM functions using software that executes in a protected
execution domain on the embedded processor itself. However,
since many embedded systems have limited processing capabilities and
are battery-powered, it is also important to ensure that the computational
and energy requirements for SW-TPMs are acceptable.
In this work, we perform an evaluation of the energy and execution
time overheads for a SW-TPM implementation on a Sharp Zaurus PDA.
We characterize the execution time and energy required by each TPM
command through actual measurements on the target platform. In addition,
we also evaluate the overheads of using SW-TPM in the context
of various end applications, including trusted boot of the Linux operating
system (OS), secure file storage, secure VoIP client, and secure
web browser. Furthermore, we observe that for most TPM commands,
the overheads are primarily due to the use of 2048-bit RSA operations
that are performed within SW-TPM. In order to alleviate SW-TPM overheads,
we evaluate the use of Elliptic Curve Cryptography (ECC) as a
replacement for the RSA algorithm specified in the Trusted Computing
Group (TCG) standards. Our experiments indicate that this optimization
can significantly reduce SW-TPM overheads (an average of 6.51X
execution time reduction and 6.75X energy consumption reduction for
individual TPM commands, and an average of 10.25X execution time
reduction and 10.75X energy consumption reduction for applications).
Our work demonstrates that ECC-based SW-TPMs are a viable approach
to realizing the benefits of trusted computing in resource-constrained
embedded systems.
Moderators: S. Vassiliadis, TU Delft, NL; P. Ienne, EPFL Lausanne, CH
-
Utilization of SECDED for Soft Error and Variation-Induced Defect Tolerance in Caches [p. 1134]
-
L.D. Hung, H. Irie, M. Goshima and S. Sakai
Combination of SECDED with a redundancy technique
can effectively tolerate a high variation-induced defect rate
in future processes. However, while a defective cell in a
block can be repaired by SECDED, the block becomes vulnerable
to soft errors. This paper proposes a technique to
deal with the degraded resilience against soft errors. Only
clean data can be stored in defective blocks of a cache.
This constraint is enforced through selective write-through
mechanism. An error occurring in a defective block can
be detected and the correct data can be obtained from the
lower level caches.
-
Transient Fault Prediction Based on Anomalies in Processor Events [p. 1140]
-
S. Narayanasamy, A. Coskun and B. Calder
Future microprocessors will be highly susceptible to transient errors
as the sizes of transistors decrease due to CMOS scaling. Prior
techniques advocated full scale structural or temporal redundancy to
achieve fault tolerance. Though they can provide complete fault coverage,
they incur significant hardware and/or performance cost. It is
desirable to have mechanisms that can provide partial but sufficiently
high fault coverage with negligible cost.
To meet this goal, we propose leveraging speculative structures
that already exist in modern processors. The proposed mechanism
is based on the insight that when a fault occurs, it is likely that
the incorrect execution would result in abnormally higher or lower
number of mispredictions (branch mispredictions, L2 misses, store
set mispredictions) than a correct execution. We design a simple
transient fault predictor that detects the anomalous behavior in the
outcomes of the speculative structures to predict transient faults.
-
Low-cost Protection for SER Upsets and Silicon Defects [p. 1146]
-
M. Mehrara, M. Attariyan, S. Shyam, K. Constantinides, V. Bertacco and T. Austin
Extreme transistor scaling trends in silicon technology are
soon to reach a point where manufactured systems will suffer
from limited device reliability and severely reduced life-time,
due to early transistor failures, gate oxide wear-out, manufacturing
defects, and radiation-induced soft errors (SER).
In this paper we present a low-cost technique to harden a
microprocessor pipeline and caches against these reliability
threats. Our approach utilizes online built-in self-test
(BIST) and microarchitectural checkpointing to detect, diagnose
and recover the computation impaired by silicon defects
or SER events. The approach works by periodically
testing the processor to determine if the system is broken.
If so, we reconfigure the processor to avoid using the broken
component. A similar mechanism is used to detect SER
faults, with the difference that recovery is implemented by
re-execution. By utilizing low-cost techniques to address defects
and SER, we keep protection costs significantly lower
than traditional fault-tolerance approaches while providing
high levels of coverage for a wide range of faults. Using
detailed gate-level simulation, we find that our approach
provides 95% and 99% coverage for silicon defects and SER
events, respectively, with only a 14% area overhead.
-
Working with Process Variation Aware Caches [p. 1152]
-
M. Mutyam and V. Narayanan
Deep-submicron designs have to take care of process
variation effects as variations in critical process parameters
result in large variations in access latencies of hardware
components. This is severe in the case of memory components
as minimum sized transistors are used in their design.
In this work, by considering on-chip data caches, we
study the effect of access latency variations on performance.
We discuss performance losses due to the worst-case design,
wherein the entire cache operates with the worstcase
process variation delay, followed by process variation
aware cache designs which work at set-level granularity.
We then propose a technique called block rearrangement
to minimize performance loss incurred by a process variation
aware cache which works at set-level granularity. Using
block rearrangement technique, we rearrange the physical
locations of cache blocks such that a cache set can have
its "n" blocks (assuming a n-way set-associative cache) in
multiple rows instead of a single row as in the case of a
cache with conventional addressing scheme. By distributing
blocks of a cache set over multiple sets, we minimize the
number of sets being affected by process variation. We evaluate
our technique using SPEC2000 CPU benchmarks and
show that our technique achieves significant performance
benefits over caches with conventional addressing scheme.
-
(252) An Enhanced Technique for the Automatic Generation of Effective Diagnosis-oriented Test
Programs for Processor [p. 1158]
-
E. Sanchéz, M. Schillaci, G. Squillero and M. Sonza Reorda
The ever increasing usage of microprocessor devices is
sustained by a high volume production that in turn
requires a high production yield, backed by a controlled
process. Fault diagnosis is an integral part of the
industrial effort towards these goals. This paper presents a
new methodology that significantly improves over a
previous work. The goal is construction of cost-effective
programs sets for software-based diagnosis of
microprocessors. The methodology exploits existing postproduction
test sets, designed for software-based self-test,
and may use an already developed infrastructure IP to
perform the diagnosis. Experimental results are reported
in the paper comparing the new results with existing ones,
and showing the effectiveness of the new approach for an
Intel i8051 processor core.
-
(161) Functional and Timing Validation of Partially Bypassed Processor Pipelines [p. 1164]
-
Q. Zhu, A. Shrivastava and N. Dutt
Customizing the bypasses in pipelined processors is an effective
and popular means to perform power, performance and
complexity trade-offs in embedded systems. However existing
techniques are unable to automatically generate test patterns to
functionally validate a partially bypassed processor. Manually
specifying directed test sequences to validate a partially bypassed
processor is not only a complex and cumbersome task, but is also
highly error-prone. In this paper we present an automatic
directed test generation technique to verify a partially bypassed
processor pipeline using a high-level processor description. We
define a fault model and coverage metric for a partially bypassed
processor pipeline and demonstrate that our technique can fully
cover all the faults using 107,074 tests for the Intel XScale
processor within 40 minutes. In contrast, randomly generated
tests can achieve 100% coverage with 2 million tests after half
day. Furthermore, we demonstrate that our technique is able to
generate tests for all possible bypass configurations of the Intel
XScale processor.
Moderators: V. Bertacco, U of Michigan, US; S. Quer, Politecnico di Torino, IT
-
A Compositional Approach to the Combination of Combinational and Sequential Equivalence
Checking of Circuits without Known Reset States [p. 1170]
-
I.-H. Moon, B. Bjesse and C. Pixley
As the pressure to produce smaller and faster designs increases,
the need for formal verification of sequential transformations
increases proportionally. In this paper we describe
a framework that attempts to extend the set of designs that
can be equivalence checked. Our focus lies in
integrating sequential equivalence checking into a standard
design flow that relies on combinational equivalence checking
today. In order to do so, we can not make use of reset
state or reset sequence information (as this is not given in
combinational equivalence checking), and we need to mitigate
the complexity inherent in the traditional sequential
equivalence checking algorithms. Our solution integrates
combinational and sequential equivalence checking in such
a way that the individual analyses benefit from each other.
The experimental results show that our framework can verify
designs which are out of range for pure sequential equivalence
checking methods aimed designs with unknown reset states.
-
Estimating Functional Coverage in Bounded Model Checking [p. 1176]
-
D. GroΒe, U. Kühne and R. Drechsler
Formal verification is an important issue in circuit and
system design. In this context, Bounded Model Checking
(BMC) is one of the most successful techniques. But even if
all specified properties can be verified, it is difficult to determine
whether they cover the complete functional behavior
of a design. We propose a pragmatic approach to estimate
coverage in BMC. The approach can easily be integrated in
a BMC tool with only minor changes. In our approach, a
coverage property is generated for each important signal. If
the considered properties do not describe the signal's entire
behavior, the coverage property fails and a counter-example
is generated. From the counter-example an uncovered scenario
can be derived. In this way the approach also helps
in design understanding. Our method is demonstrated on
a RISC CPU. Based on the results we identified coverage
gaps. We were able to close all of them and achieved 100%
functional coverage.
-
Abstraction and Refinement Techniques in Automated Design Debugging [p. 1182]
-
S. Safarpour and A. Veneris
Verification is a major bottleneck in the VLSI design
flow with the tasks of error detection, error localization, and error
correction consuming up to 70% of the overall design effort.
This work proposes a departure from conventional debugging
techniques by introducing abstraction and refinement during
error localization. Under this new framework, existing debugging
techniques can handle large designs with long counter-examples
yet remain run time and memory efficient. Experiments on
benchmark and industrial designs confirm the effectiveness of
the proposed framework and encourage further development of
abstraction and refinement methodologies for existing debugging
techniques.
-
Automatic Hardware Synthesis from Specifications: A Case Study [p. 1188]
-
R. Bloem, S. Galler, B. Jobstmann, N. Piterman, A. Pnueli and M. Weiglhofer
We propose to use a formal specification language as
a high-level hardware description language. Formal languages
allow for compact, unambiguous representations
and yield designs that are correct by construction. The idea
of automatic synthesis from specifications is old, but used to
be completely impractical. Recently, great strides towards
efficient synthesis from specifications have been made. In
this paper we extend these recent methods to generate compact
circuits and we show their practicality by synthesizing
an arbiter for ARM's AMBA AHB bus and a generalized
buffer from specifications given in PSL. These are the first
industrial examples that have been synthesized automatically
from their specifications.
Moderators: R. Suaya, Mentor Graphics, FR; P. Feldmann, IBM T J Watson Research Center, US
-
pFFT in FastMaxwell: A Fast Impedance Extraction Solver for 3D Conductor Structures over Substrate [p. 1194]
-
T. Moselhy, X. Hu and L. Daniel
In this paper we describe the acceleration algorithm
implemented in FastMaxwell, a program for wideband electromagnetic
extraction of complicated 3D conductor structures
over substrate. FastMaxwell is based on the integral domain
mixed potential integral equation (MPIE) formulation, with
3-D full-wave substrate dyadic Green's function kernel. Two
dyadic Green's functions are implemented. The pre-corrected
Fast Fourier Transform (pFFT) algorithm is generalized and used
to accelerate the translational invariant complex domain dyadic
kernel. Computational results are given for a variety of structures
to validate the accuracy and efficiency of FastMaxwell. O(NlogN)
computational complexity is demonstrated by our results in both
time and memory.
-
Optimization-based Wideband Basis Functions for Efficient Interconnect Extraction [p. 1200]
-
X. Hu, T. Moselhy, J. White and L. Daniel
This paper introduces a technique for the numerical
generation of basis functions that are capable of parameterizing
the frequency-variant nature of cross-sectional conductor
current distributions. Hence skin and proximity effects can be
captured utilizing much fewer basis functions in comparison
to the prevalently-used piecewise-constant basis functions. One
important characteristic of these basis functions is that they only
need to be pre-computed once for a frequency range of interest
per unique conductor cross-sectional geometry, and they can be
stored off-line with a minimal associated cost. In addition, the
robustness of these frequency-independent basis functions are
enforced using an optimization routine. It has been demonstrated
that the cost of solving a complex interconnect system can
be reduced by a factor of 170 when compared to the use of
piecewise-constant basis functions over a wide range of operating
frequencies.
-
Thermally Robust Clocking Schemes for 3D Integrated Circuits [p. 1206]
-
M. Mondal, A.J. Ricketts, S. Kirolos, T. Ragheb, G. Link, N. Vijaykrishnan and Y. Massoud
3D integration of multiple active layers into a single chip is a
viable technique that greatly reduces the length of global wires
by providing vertical connections between layers. However, dissipating
the heat generated in the 3D chips possesses a major
challenge to the success of the technology and is the subject of
active current research. Since the generated heat degrades the
performance of the chip, thermally insensitive/adaptive circuit
design techniques are required for better overall system performance.
In this paper, we propose a thermally adaptive 3D clocking
scheme that dynamically adjusts the driving strengths of the
clock buffers to reduce the clock skew between terminals. We investigate
the relative merits and demerits of two alternative clock
tree topologies in this work. Simulation results demonstrate that
our adaptive technique is capable of reducing the skew by 61.65%
on the average, leading to much improved clock synchronization
and design performance in the 3D realm.
-
Double-Via-Driven Standard Cell Library Design [p. 1212]
-
T.-Y. Lin, T.-H. Lin, H.-H. Tung and R.-B. Lin
Double-via placement is important for increasing chip
manufacturing yield. Commercial tools and recent work
have done a great job for it. However, they are found with
a limited capability of placing more double vias (called
via1) between metal 1 and metal 2. Such a limitation is
caused by the way we design the standard cells and can
not be resolved by developing better tools. This paper
presents a double-via-driven standard cell library design
approach to solving this problem. Compared to the results
obtained using a commercial cell library, our library on
average achieves 78% reduction in dead vias and 95%
reduction in dead via1s at the expense of 11% increase in
total via count. We achieve these results (almost) at no
extra cost in total cell area and wire length.
-
Analysis of Power Consumption and BER of Flip-flop Based Interconnect Pipelining [p. 1218]
-
J. Xu, A. Roy and M.H. Chowdhury
This paper addresses the problem of interconnect pipelining
from both power consumption and bit error rate (BER) point of
view and tries to find the optimal solution for a given wire
pipelining scheme in nanometer scale very large scale
integration technologies. In this paper a detailed analysis for the
dependency of power consumption and BER on the number of
flip-flops inserted and repeater size is performed. For the best
tradeoff between the wire delay, BER and power consumption, a
methodology is developed to optimize the repeater size and the
number of flip-flops inserted which maximize a user-specified
figure of merit. Then this methodology is applied to calculate the
optimal solutions for some International Technology Roadmap
for Semiconductor technology nodes.
Organizers: L. Pozzi, Lugano U, CH; P. Paulin, STMicroelectronics, CA
Moderator: P. Paulin, STMicroelectronics, CA
-
A Future of Customizable Processors: Are We There Yet? [p. 1224]
-
L. Pozzi and P. G. Paulin
Customizable processors are being used increasingly often
in SoC designs. During the past few years, they have
proven to be a good way to solve the conflicting flexibility
and performance requirements of embedded systems design.
While their usefulness has been demonstrated in a
wide range of products, a few challenges remain to be addressed:
1) Is extending a standard core template the right
way to customization, or is it preferable to design a fully
customized core from scratch? 2) Is the automation offered
by current toolchains, in particular generation of complex
instructions and their reuse, enough for what users would
like to see? 3) And when we look at the future with the increasing
use of multi-processor SoCs, do we see a sea of
identical customized processors, or a heterogeneous mix?
We comment and elaborate here on these challenges and
open questions.
Moderators: J. Dielissen, NXP Research, NL ; T. Shiple, Synopsys, FR
-
Fast and Accurate Routing Demand Estimation for Efficient Routability-driven Placement [p. 1226]
-
P. Spindler and F.M. Johannes
This paper presents a fast and accurate routing demand
estimation called RUDY and its efficient integration in a force-directed
quadratic placer to optimize placements for routability.
RUDY is based on a Rectangular Uniform wire DensitY per net and
accurately models the routing demand of a circuit as determined by the
wire distribution after final routing. Unlike published routing demand
estimation, RUDY depends neither on a bin structure nor on a certain
routing model to estimate the behavior of a router. Therefore RUDY is
independent of the router.
Our fast and robust force-directed quadratic placer is based on a
generic demand-and-supply model and is guided by the routing demand
estimation RUDY to optimize placements for routability. This yields a
placer which simultaneously reduces the routing demand in congested
regions and increases the routing supply there. Therefore our placer
fully utilizes the potential to optimize the routability. This results in the
best published routed wirelength of the IBMv2 benchmark suite until
now. In detail, our approach outperforms mPL, ROOSTER, and APlace
by 9%, 8%, and 5%, respectively. Compared by the CPU times, which
ROOSTER needs to place this benchmark, our routability optimization
placer is eight times faster.
-
Yield-aware Placement Optimization [p. 1232]
-
P. Azzoni, M. Bertoletti, N. Dragone, F. Fummi, C. Guardiani and W. Vendraminetto
In this paper we describe a methodology addressing the
issue of avoiding yield hazardous cell abutments during
placement. This is made possible by accurate
characterization of the yield penalty associated with
particular cell-to-cell interactions. Of course
characterizing all possible cell abutments in a library of
600+ cells is impractical. We will describe some simple
heuristics that attempt to resolve the cell abutment precharacterization
complexity. Finally we will show a
possible implementation of the proposed yield-aware
placement optimization methodology and demonstrate the
potential of cell interaction penalty characterization for a
90nm design test case.
-
Microarchitecture Floorplanning for Sub-threshold Leakage Reduction [p. 1238]
-
H. Mogal and K. Bazargan
Lateral heat conduction between modules affects the temperature
profile of a floorplan, affecting the leakage power of individual
blocks which increasingly is becoming a larger fraction of the
overall power consumption with scaling of fabrication technologies.
By modeling temperature dependent leakage power within
a microarchitecture-aware floorplanning process, we propose a
method that reduces sub-threshold leakage power. To that end,
two leakage models are used: a transient formulation independent
of any leakage power model and a simpler formulation derived
from an empirical leakage power model, both showing good fidelity
to detailed transient simulations. Our algorithm can reduce
subthreshold leakage by upto 15% with a minor degradation in
performance, compared to a floorplanning process that does not
model leakage. We also show the importance of modeling whitespace
during floorplanning and its impact on leakage savings.
Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: E. Lansard, Alcatel Alenia Space, FR
-
Industrial Applications [p. 1244]
-
X. Olive, J.-M. Pasquet and D. Flament
This first technical session is further developing the technological dimensions of technology
transfer, with three illustrations of successful and representative industrial applications. It
covers the cases of embedded autonomy in spacecraft applications, advanced avionics
solutions for satellite communication and component hybridization for navigation.
-
Flying Embedded: The Industrial Scene and Challenges for Embedded Systems in Aeronautics
and Space [p. 1246]
-
J. Botti
This keynote address, given by an executive representative of the European aeronautics and space
industry, introduces the strategic stakes and the international competitive landscape, for further
development and understanding of the sizing dimensions of technology transfer all along the special
day.
Moderators: R. Locatelli, STMicroelectronics, IT; R. Pacalet, ENST, FR
-
Compact Hardware Design of Whirlpool Hashing Core [p. 1247]
-
T. Alho, P. Hämäläinen, M. Hännikäinen and T.D. Hämäläinen
Weaknesses have recently been found in the widely used
cryptographic hash functions SHA-1 and MD5. A potential
alternative for these algorithms is the Whirlpool hash
function, which has been standardized by ISO/IEC and
evaluated in the European research project NESSIE. In
this paper we present a Whirlpool hashing hardware core
suited for devices in which low cost is desired. The core
constitutes of a novel 8-bit architecture that allows compact
realizations of the algorithm. In the Xilinx Virtex-II
Pro XC2VP40 FPGA, our implementation consumes 376
slices and achieves the throughput of 81.5 Mbit/s. The resource
utilization of our design is one fourth of the smallest
Whirlpool implementation presented to date.
-
An Efficient Polynomial Multiplier in GF(2m) and Its Application to ECC Designs [p. 1253]
-
S. Peter and P. Langendörfer
In this paper we discuss approaches that allow to construct
efficient polynomial multiplication units. Such multipliers
are the most important components of ECC hardware
accelerators. The proposed hRAIK multiplication improves
energy consumption, the longest path, and required
silicon area compared to state of the art approaches. We use
such a core multiplier to construct an efficient sequential
polynomial multiplier based on the known iterative Karatsuba
method. Finally, we exploit the beneficial properties
of the design to build an ECC accelerator. The design for
GF(2233) requires about 1.4 mm2 cell area in a .25μm technology
and needs 80 μsec for an EC point multiplication.
-
Flexible Hardware Reduction for Elliptic Curve Cryptography in GF(2m) [p. 1259]
-
S. Peter, P. Langendörfer and K. Piotrowski
In this paper we discuss two ways to provide flexible
hardware support for the reduction step in Elliptic
Curve Cryptography in binary fields (GF(2m)). In our first
approach we are using several dedicated reduction units
within a single multiplier. Our measurement results show
that this simple approach leads to an additional area consumption
of less than 10% compared to a dedicated design
without performance penalties. In our second approach
any elliptic curve cryptography up to a predefined maximal
length can be supported. Here we take advantage of the features
of commonly used reduction polynomials. Our results
show a significant area penalty compared to dedicated designs.
However, we achieve flexibility and the performance
is still significantly better than those of known ECC hardware
accelerator approaches with similar flexibility or even
software implementations.
-
Overcoming Glitches and Dissipation Timing Skews in Design of DPA-Resistant Cryptographic
Hardware [p. 1265]
-
K.J. Lin, S.C. Fang, S.-H. Yang, and C.C. Lo
Cryptographic embedded systems are vulnerable to
Differential Power Analysis (DPA) attacks. In this
paper, we propose a logic design style, called as Precharge
Masked Reed-Muller Logic (PMRML) to
overcome the glitch and Dissipation Timing Skew (DTS)
problems in design of DPA-resistant cryptographic
hardware. Both problems can significantly reduce the
DPA-resistance. To our knowledge, the DTS problem
and its countermeasure have not been reported. The
PMRML design can be fully realized using common
CMOS standard cell libraries. Furthermore, it can be
used to implement universal functions since any
Boolean function can be represented as the Reed-
Muller form. An AES encryption module was
implemented with multi-stage PMRML. The results
show the efficiency and effectiveness of the PMRML
design methodology.
Moderators: A. Rubio, UP Catalunya, ES; S. Mir, TIMA Laboratory, FR
-
Dynamic Critical Resistance: A Timing-Based Critical Resistance Model for Statistical
Delay Testing of Nanometer ICs [p. 1271]
-
J.L. Rosselló, C. de Benito, S.A. Bota, J. Segura
As CMOS IC feature sizes shrink down to the nanometer
regime, the need for more efficient test methods capable of
dealing with new failure mechanisms increases. Advances in this
domain require a detailed knowledge of these failure physical
properties and the development of appropriated test methods.
Several works have shown the relative increase of resistive
defects (both opens and shorts), and that they mainly affect
circuit timing rather than impacting its static DC behavior.
Defect evolution, together with the increase of parameter
variations, represents a serious challenge for traditional delay
test methods based on fixed time delay limit setting. One
alternative to deal with variation relies on adopting correlation
where test limits for one parameter are settled based on its
correspondence to other circuit variables. In particular, the
correlation of circuit delay to reduced VDD has been proposed as
a useful test method. In this work we investigate the merits of this
technique for future technologies where variation is predicted to
increase, analyzing the possibilities of detecting resistive shorts
and opens.
-
Sensitivity Analysis for Fault-analysis and Tolerance in RF Front-end Circuitry [p. 1277]
-
T. Das and P.R. Mukund
RFIC reliability is fast becoming a major bottleneck in
the yield and performance of modern IC systems, as
process complexity and levels of integration continually
increase. Due to high frequencies involved, testing these
chips is both complicated and expensive. While the
areas of Automated testing and Self-test have received
significant attention over the past few years, no formal
framework of fault-models or sensitivity-models exists in
the RF domain. This paper describes a Sensitivity
Analysis methodology as a first step towards such a
framework. It is applied towards a Low Noise Amplifier,
and a case-study application is discussed by using
design and experimental results of an adaptive LNA
designed in the IBM6RF 0.25 μm CMOS process.
-
A Two-Tone Test Method for Continuous-Time Adaptive Equalizers [p. 1283]
-
D. Hong, S. Sabri, K.-T. Cheng and C.P. Yue
This paper describes a novel test method for
continuous-time adaptive equalizers. This technique
applies a two-sinusoidal-tone signal as stimulus and
includes an RMS detector for testing, which incurs no
performance degradation and a very small area overhead.
To validate the technique, we used a recently published
adaptive equalizer as our test case and conducted both
behavioral and transistor-level simulations. Simulation
results demonstrate that the technique is effective in
detecting defects in the equalizer, which might not be
easily detected by the conventional eye-diagram method.
-
Worst-Case Design and Margin for Embedded SRAM [p. 1289]
-
R. Aitken and S. Idgunji
An important aspect of Design for Yield for
embedded SRAM is identifying the expected worst case
behavior in order to guarantee that sufficient design
margin is present. Previously, this has involved
multiple simulation corners and extreme test
conditions. It is shown that statistical concerns and
device variability now require a different approach,
based on work in Extreme Value Theory. This method is
used to develop a lower-bound for variability-related
yield in memories.
-
Pulse Propagation for the Detection of Small Delay Defects [p. 1295]
-
M. Favalli and C. Metra
This paper addresses the problems related to resistive
opens and bridging faults which cannot be detected using
delay fault testing because they lie out of the most critical
paths. Even if the induced defect is not large enough to result
in timing violations, these faults may give rise to reliability
problems. To detect them, we propose a testing method that
is based on the propagation of pulses within the faulty circuit
and that exploits the degraded capabity of faulty paths to
propagate pulses. The effectiveness of the proposed method
is analyzed at the electrical level and compared with the use
of reduced clock period which can detect the same class of
faults. Results show similar performance in the case of resistive
opens and better performance in the case of bridgings.
Moreover, the proposed approach is not affected by problems
on the clock distribution network.
-
BIST Method for Die-Level Process Parameter Variation Monitoring in Analog/Mixed-Signal
Integrated Circuits [p. 1303]
-
A. Zjajo, M.J. Barragan Asian and J. Pineda de Gyvez
This paper reports a new built-in self-test
scheme for analog and mixed-signal devices based on
die-level process monitoring. The objective of this test is
not to replace traditional specification-based tests, but to
provide a reliable method for early identification of
excessive process parameter variations in production
tests that allows quickly discarding of the faulty circuits.
Additionally, the possibility of on-chip process deviation
monitoring provides valuable information, which is used
to guide the test and to allow the estimation of selected
performance figures. The information obtained through
guiding and monitoring process variations is re-used
and supplement the circuit calibration.
Moderators: R. Bloem, TU Graz, AT; R. Drechsler, Bremen U, DE
-
A New Hybrid Solution to Boost SAT Solver Performance [p. 1307]
-
L. Fang and M.S. Hsiao
Due to the widespread demands for efficient SAT solvers in
Electronic Design Automation applications, methods to boost
the performance of the SAT solver are highly desired. We propose
a Hybrid Solution to boost SAT solver performance in this
paper, via an integration of local and DPLL-based search approaches.
A local search is used to identify a subset of clauses
to be passed to a DPLL SAT solver through an incremental interface.
In addition, the solution obtained by the DPLL solver
on the subset of clauses is fed back to the local search solver to
jump over any locally optimal points. The proposed solution is
highly portable to the existing SAT solvers. For satisfiable instances,
up to an order of magnitude speedup can be obtained
via the proposed hybrid solver.
-
QuteSAT: A Robust Circuit-based SAT Solver for Complex Circuit Structure [p. 1313]
-
C.-A. Wu, T.-H. Lin, C.-C. Lee and C.-Y. Huang
We propose a robust circuit-based Boolean
Satisfiability (SAT) solver, QuteSAT, that can be applied
to complex circuit netlist structure. Several novel
techniques are proposed in this paper, including: (1) a
generic watching scheme on general gate types for
efficient Boolean Constraint Propagation (BCP), (2) an
implicit implication graph representation for efficient
learning, and (3) careful engineering on the most
advanced SAT algorithms for the circuit-based data
structure. Our experimental results show that our baseline
solver, without taking the advantage of the circuit
information, can achieve the same performance as the
fastest Conjunctive Normal Form (CNF)-based solvers.
We also demonstrate that by applying a simple circuitoriented
decision ordering technique (J-frontier), our
solver can constantly outperform the CNF ones for more
than 15+ times. With the great flexibility on the circuitbased
data structure, our solver can serve as a solid
foundation for the general SAT research in the future.
-
Boosting the Role of Inductive Invariants in Model Checking [p. 1319]
-
G. Cabodi, S. Nocco and S. Quer
This paper focuses on inductive invariants in unbounded
model checking to improve efficiency and scalability.
First of all, it introduces optimized techniques to speedup
the computation of inductive invariants, considering
both equivalences and implications between pairs of nodes
in the logic network. Secondly, it presents a very efficient
dynamic procedure, based on an incremental SAT approach,
to reduce the set of checked invariants. Finally, it
shows how to effectively integrate inductive invariant computations
with state-of-the-art model checking procedures.
Experiments address different property verification aspects,
and specifically consider cases where inductive invariants
alone are not sufficient for the final proof.
-
Image Computation and Predicate Refinement for RTL Verilog Using Word Level Proofs [p. 1325]
-
D. Kroening and N. Sharygina
Automated abstraction is the enabling technique for
model checking large circuits. Predicate Abstraction is one
of the most promising abstraction techniques. It relies on
the efficient computation of predicate images and the right
choice of predicates. Existing algorithms use a net-list-level
circuit model for computing predicate images. 1) This paper
describes a proof-based algorithm that computes an
over-approximation of the predicate image at the wordlevel,
and thus, scales to larger circuits. 2) The previous
work relies on the computation of the weakest preconditions
in order to refine the set of predicates. In contrast to that,
we propose to extract predicates from a word-level proof to
refine the set of predicates.
Moderators: A. Darte, ENS Lyon, FR; H. van Someren, ACE Associated Compiler Experts, NL
-
Polynomial-Time Subgraph Enumeration for Automated Instruction Set Extension [p. 1331]
-
P. Bonzini and L. Pozzi
This paper proposes a novel algorithm that, given a
data-flow graph and an input/output constraint, enumerates
all convex subgraphs under the given constraint in polynomial
time with respect to the size of the graph. These
subgraphs have been shown to represent efficient Instruction
Set Extensions for customizable processors. The search
space for this problem is inherently polynomial but, to our
knowledge, this is the first paper to prove this and to present
a practical algorithm for this problem with polynomial complexity.
Our algorithm is based on properties of convex subgraphs
that link them to the concept of multiple-vertex dominators.
We discuss several pruning techniques that, without
sacrificing the optimality of the algorithm, make it practical
for data-flow graphs of a thousands nodes or more.
-
Interrupt and Low-level Programming Support for Expanding the Application Domain of
Statically-Scheduled Horizontally-Microcoded Architectures in Embedded Systems [p. 1337]
-
M. Reshadi and D. Gajski
The increasing role of software in the embedded systems has
made processor an important component in these systems. However,
to meet the tight constraints of embedded application, it is often
required to customize the processor for the application. Customizing
instruction-based processors is difficult and very challenging. Design
approaches based on statically-scheduled horizontal-microcoded
architectures have been proposed to simplify the architecture
customization. In these approaches, first the datapath is specified by
the designer, and then the operations of the datapath are extracted
automatically. Since the operations are statically scheduled in these
architectures (i) low-level programming using assembly is impossible
or very tedious; and (ii) execution of programs cannot be interrupted
arbitrarily. In this paper, we address the above problems. We show
how to efficiently handle interrupts in such architectures and also
propose an elegant way of controlling low-level hardware resources
in a general way in C language. We also show that after adding
interrupt and low-level programming we could use the above
architectural style in a multi-core system to implement a complete
MP3 decoder that can process 122 frames per second while the
standard requirement is 38 frames per seconds.
-
DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for
Embedded Systems [p. 1343]
-
Z. Ge, W.-F. Wong and H.-B. Lim
Power consumption is of crucial importance to embedded
systems. In such systems, the instruction memory hierarchy
consumes a large portion of the total energy consumption.
A well designed instruction memory hierarchy
can greatly decrease the energy consumption and increase
performance. The performance of the instruction memory
hierarchy is largely determined by the specific application.
Different applications achieve better energy-performance
with different configurations of the instruction memory hierarchy.
Moreover, applications often exhibit different phases
during execution, each exacting different demands on the
processor and in particular the instruction memory hierarchy.
For a given hardware resource budget, an even better
energy-performance may be achievable if the memory hierarchy
can be reconfigured before each of these phases.
In this paper, we propose a new dynamically reconfigurable
instruction memory hierarchy to take advantage of
these two characteristics so as to achieve significant energyperformance
improvement. Our proposed instruction memory
hierarchy, which we called DRIM, consists of four banks
of on-chip instruction buffers. Each of these can be configured
to function as a cache or as a scratchpad memory
(SPM) according to the needs of an application and its execution
phases. Our experimental results using six benchmarks
from the MediaBench and the MiBench suites show
that DRIM can achieve significant energy reduction.
-
SoftSIMD . Exploiting Subword Parallelism Using Source Code Transformations [p. 1349]
-
S. Kraemer, R. Leupers, G. Ascheid and H. Meyr
SIMD instructions are used to speed up multimedia ap-
plications in high performance embedded computing. Ven-
dors often use proprietary platforms which are incompati-
ble with others. Therefore, porting software is a very com-
plex and time consuming task. Moreover, lots of existing
embedded processors do not have SIMD extensions at all.
But they do provide a wide data path which is 32-bit or
wider. Usually, multimedia applications work on short data
types of 8 or 16-bit. Thus, only the lower bits of the data
path are used and therefore only a fraction of the available
computing power is exploited for such algorithms. This
paper discusses the possibility to make use of the upper
bits of the data path by emulating true SIMD instructions.
These instructions are implemented purely in software us-
ing a high level language such as C. Therefore, the applica-
tion can be modified by making use of source code transfor-
mations which are inherently portable. The benefit of this
approach is that the computing resources are used more ef-
ficiently without compromising the portability of the code.
Experiments have shown that a significant speedup can be
obtained by this approach.
-
A Process Splitting Transformation for Kahn Process Networks [p. 1355]
-
S. Meijer, B. Kienhuis, A. Turjan and E. de Kock
In this paper we present a process splitting transformation for Kahn
process networks. Running applications written in this parallel program
specification on a multiprocessor architecture does not guarantee
that the runtime requirements are met. Therefore, it may be
necessary to further analyze and optimize Kahn process networks.
In this paper, we will present a four-step transformation that results
in a functionally equivalent process network, but with a changed
and optimized network structure. The class of networks that can
be handled is not restricted to static networks. The novelty of this
approach is that it can also handle processes with dynamic program
statements. We will illustrate the transformation prototyped
in GCC for a JPEG decoder, showing a 21% performance improvements.
Moderators: S. Sapatnekar, Minnesota U, US; T. Shiple, Synopsys, FR
-
Computing Synchronizer Failure Probabilities [p. 1361]
-
S. Yang and M. Greenstreet
System-on-Chip designs often have a large number
of timing domains. Communication between these domains
requires synchronization, and the failure probabilities of these
synchronizers must be characterized accurately to ensure the
robustness of the complete system. We present a novel approach
for determining the failure probabilities of synchronizer circuits.
Our approach using numerical integration to account for the nonlinear
behaviour of real synchronizer circuits. We complement
this with small-signal techniques to enable accurate estimation
of extremely small failure probabilities. Our approach is fully
automated, is suitable for integration into circuit simulation
tools such as SPICE and enables accurate characterization of
extremely small failure probabilities.
-
Layout-Aware Gate Duplication and Buffer Insertion [p. 1367]
-
D. Bañeres, J. Cortadella and M. Kishinevsky
An approach for layout-aware interconnect optimization is
presented. It is based on the combination of three sub-problems into
the same framework: gate duplication, buffer insertion and placement.
Different techniques to control the combinatorial explosion are proposed.
The experimental results show tangible benefits in delay that endorse
the suitability of integrating the three sub-problems in the same
framework. The results also corroborate the increasing relevance of
interconnect optimization in future semiconductor technologies.
-
Self-Heating-Aware Optimal Wire Sizing under Elmore Delay Model [p. 1373]
-
M. Ni and S.O. Memik
Global interconnect temperature keeps rising in the current
and future technologies due to self-heating and the adiabatic
property of top metal layers. The thermal effects impact adversely
both reliability and performance of the interconnect
wire, shortening the interconnect lifetime and increasing the
interconnect delay. Such effects must be considered during
the process of interconnect design. In this paper, one important
argument is that the traditional linear dependence between
wire resistance and wire width is no longer adequate for
high layer interconnects due to the adiabatic property of these
wires. By using curve fitting technique, we propose a quadratic
model to represent the resistance of interconnect, which is
aware of the thermal effects. Based on this model and the
Elmore delay model, we derived a linear optimal wire sizing
formula in form of f(x) = ax + b. Compared to non-thermal-aware
exponential wire sizing formula in form of f(x) = ae-bx,
we observed a 49.7% average delay gain with different choices
of physical parameters.
Moderators: M. Zwolinski, Southampton U, UK; F. Gaffiot, Ecole Centrale de Lyon, FR
-
Statistical Blockade: A Novel Method for Very Fast Monte Carlo Simulation of Rare Circuit
Events, and Its Application [p. 1379]
-
A. Singhee and R.A. Rutenbar
Circuit reliability under statistical process variation is an area of
growing concern. For highly replicated circuits such as SRAMs and
flip flops, a rare statistical event for one circuit may induce a not-sorare
system failure. Existing techniques perform poorly when tasked
to generate both efficient sampling and sound statistics for these rare
events. Statistical Blockade is a novel Monte Carlo technique that allows
us to efficiently filter - to block - unwanted samples insufficiently
rare in the tail distributions we seek. The method synthesizes
ideas from data mining and Extreme Value Theory, and shows speedups
of 10X -100X over standard Monte Carlo.
-
Clock Domain Crossing Fault Model and Coverage Metric for Validation of SoC Design [p. 1385]
-
Y. Feng, Z. Zhou, D. Tong and X. Cheng
Multiple asynchronous clock domains have been increasingly
employed in System-on-Chip (SoC) designs for
different I/O interfaces. Functional validation is one of the
most expensive tasks in the SoC design process. Simulation
on register transfer level (RTL) is still the most widely
used method. It is important to quantitatively measure the
validation confidence and progress for clock domain crossing
(CDC) designs. In this paper, we propose an efficient
method for definition of CDC coverage, which can be used
in RTL simulation for a multi-clock domain SoC design.
First, we develop a CDC fault model to present the actual
effect of metastability. Second, we use a temporal data flow
graph (TDFG) to propagate the CDC faults to observable
variables. Finally, CDC coverage is defined based on the
CDC faults and their observability. Our experiments on a
commercial IP demonstrate that this method is useful to find
CDC errors early in the design cycles.
-
Fast Statistical Circuit Analysis with Finite-Point Based Transistor Model [p. 1391]
-
M. Chen, W. Zhao, F. Liu and Y Cao
A new approach of transistor modeling is
developed for fast statistical circuit simulation in the
presence of variations. For both I-V and C-V
characteristics of a transistor, finite data points are
identified by their physical meaning; the impact of
process and design variations is embedded into these
points as closed-form expressions. Then, the entire I-V
and C-V are extrapolated using polynomial formulas.
This novel approach significantly enhances the
simulation speed with sufficient accuracy. The model is
implemented in Verilog-A at 65nm node. Compared to
simulations with the BSIM model, the computation
time can be reduced by 7x in transient analysis and 9x
in Monte-Carlo simulations.
-
Statistical Simulation of High-Frequency Bipolar Circuits [p. 1397]
-
W. Schneider, M. Schroter, W. Kraus and H. Wittkopf
This paper describes a physics-based methodology for
computationally efficient statistical modeling of highfrequency
bipolar transistors along with its practical
implementation into a production process design kit.
Applications to statistical modeling, circuit simulation, and
yield optimization are demonstrated for an opamp circuit.
Experimental results are shown that verify the
methodology.
Organizers/Moderators: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
-
Development and Industrialization [p. 1403]
-
M. Riffiod, P. Caspi, C. Piala and J.-L. Voirin
This second technical session illustrates the methodological dimensions of technology transfer. It
elaborates on some methodologies deployed in critical steps of the whole embedded systems
development process, particularly to specify safety critical embedded systems, to manage
obsolescence of components and to certify the airworthiness of the final solutions.
Moderators: C. Heer, Infineon Technologies, DE ; O. Deprez, Texas Instruments, FR
-
Low Power Design on Algorithmic and Architectural Level: A Case Study of an HSDPA Baseband
Digital Signal Processing System [p. 1406]
-
M. Schämann, S. Hessel, U. Langmann and M. Bücker
The optimization of power consumption plays a key role
in the design of a cellular system: Increasing data rates
together with high mobility represent a constantly growing
design challenge because advanced algorithms are required
with a higher complexity, more chip area and increased
power consumption which contrast with limited power
supply. In this contribution, digital baseband components
for a High Speed Downlink Packet Access (HSDPA) system
are optimized on algorithmic and architectural level.
Three promising algorithms for the equalization of the
propagation channel are compared regarding performance,
complexity and power consumption using fixed-point
SystemC models. On architectural level an adaptive control
unit is introduced together with an output interference
analyzer. The presented strategy reduces the arithmetic
operations for convenient propagation conditions up to
70% which relates to an estimated power reduction of up
to 40% while the overall performance is not affected.
-
Mapping the Physical Layer of Radio Standards to Multiprocessor Architectures [p. 1412]
-
C. Grassmann, M. Richter and M. Sauermann
We are concerned with the software implementation
of baseband processing for the physical layer of radio
standards ("Software Defined Radio - SDR"). Given
the constraints for mobile terminals with respect to
power consumption, chip area and performance, nonstandard
architectures without compiler support are
the targets a SDR implementation has to face. For this
domain we present a way to safely move from a functional
model to the assembly level in order to come to
a tested multithreaded optimized implementation in
manageable time.
We carried out this program for the standards WLAN
IEEE 802.11b and 3GPP WCDMA exploiting various
levels of parallelism: thread level parallelism
("MIMD"), data level parallelism ("SIMD") and instruction
level parallelism ("VLIW"). We came up
with a software implementation running in real time
on Infineon's programmable Multiple SIMD Core
(MuSIC) processor.
-
Development of an ASIP Enabling Flows in Ethernet Access Using a Retargetable Compilation Flow [p. 1418]
-
K. Van Renterghem, P. Demuytere, D. Verhulst, J. Vandewege and X.-Z. Qiu
In this paper we research an FPGA based Application
Specific Instruction Set Processor (ASIP) tailored to the
needs of a flow aware Ethernet access node using a retargetable
compilation flow. The toolchain is used to develop
an initial processor design, asses the performance and identify
the potential bottlenecks.
A second design iteration results in a fully optimized
ASIP with a VLIW instruction set which allows for high degree
of parallelism among the functional units inside the
ASIP and has dedicated instructions to accelerate typical
packet processing tasks. This way, a single processor is
capable of handling the complete throughput of a gigabit
Ethernet link.
To reach the target of a 10 Gbit/s Ethernet access node
several processors operate in parallel in a multicore environment.
-
An Effective AMS Top-Down Methodology Applied to the Design of a Mixed-Signal
UWB System-on-Chip [p. 1424]
-
M. Crepaldi, M.R. Casu, M. Graziano and M. Zamboni
The design of Ultra Wideband (UWB) mixed-signal SoC
for localization applications in wireless personal area networks
is currently investigated by several researchers. The
complexity of the design claims for effective top-down
methodologies. We propose a layered approach based on
VHDL-AMS for the first design stages and on an intelligent
use of a circuit-level simulator for the transistor-level
phase. We apply the latter just to one block at a time
and wrap it within the system-level VHDL-AMS description.
This method allows to capture the impact of circuit-level design
choices and non-idealities on system performance. To
demonstrate the effectiveness of the methodology we show
how the refinement of the design affects specific UWB system
parameters such as bit-error rate and localization estimations.
-
Behavioral Modeling of Delay-Locked Loops and Its Application to Jitter Optimization in
Ultra Wide-Band Impulse Radio Systems [p. 1430]
-
E. Barajas, R. Cosculluela, D. Coutinho, D. Mateo, J. L. González, I. Cairò, S. Banda, M. Ikeda
This paper presents a behavioral model of a delaylocked
loop (DLL) used to generate the timing signals in an
integrated ultra wide-band (UWB) impulse radio (IR)
system. The requirements of these timing signals in the
context of UWB-IR systems are reviewed. The behavioral
model includes a modeling of the various noise sources in
the DLL that produce output jitter. The model is used to
find the optimum loop filter capacitor value that minimizes
output jitter. The accuracy of the behavioral model is
validated by comparing the system level simulation results
with transistor level simulations of the whole DLL.
Moderators: C. Metra, Bologna U, IT; B. Gottlieb, Intel, US
-
Soft Error Rate Analysis for Sequential Circuits [p. 1436]
-
N. Miskov-Zivanov and D. Marculescu
Due to reduction in device feature size and supply voltage, the
sensitivity to radiation induced transient faults (soft errors) of digital
systems increases dramatically. Intensive research has been done so
far in modeling and analysis of combinational circuit susceptibility to
soft errors, while sequential circuits have received much less
attention. In this paper, we present an approach for evaluating the
susceptibility of sequential circuits to soft errors. The proposed
approach uses symbolic modeling based on BDDs/ADDs and
probabilistic sequential circuit analysis. The SER evaluation is
demonstrated by the set of experimental results, which show that, for
most of the benchmarks used, the SER decreases well below a given
threshold (10-7FIT) within ten clock cycles after the hit. The results
obtained with the proposed symbolic framework are within 4%
average error and up to 11000X faster when compared to HSPICE
detailed circuit simulation.
-
Verification-Guided Soft Error Resilience [p. 1442]
-
S.A. Seshia, W. Li and S. Mitra
Algorithmic techniques for formal verification can be used
not just for bug-finding, but also to estimate vulnerability
to reliability problems and to reduce overheads of circuit
mechanisms for error resilience. We demonstrate this idea
of verification-guided error resilience in the context of soft
errors in latches. We show how model checking can be
used to identify latches in a circuit that must be protected
in order that the circuit satisfies a formal specification. Experimental
results on a Verilog implementation of the ESA
SpaceWire communication protocol indicate that the power
overhead of soft error protection can be reduced by a factor
of 4.35 by using our approach rather than protecting all
latches.
-
A Low-SER Efficient Core Processor Architecture for Future Technologies [p. 1448]
-
E.L. Rhod, C.A. Lisboa and L. Carro
Device scaling in new and future technologies brings
along severe increase in the soft error rate of circuits, for
combinational and sequential logic. Although potential
solutions have started to be investigated by the
community, the full use of future resources in circuits
tolerant to SETs, without performance, area or power
penalties, is still an open research issue. This paper
introduces MemProc, an embedded core processor with
extra low SER sensitivity, and with no performance or
area penalty when compared to its RISC counterpart.
Central to the SER reduction are the use of new magnetic
memories (MRAM and FRAM) and the minimization of
the combinational logic area in the core. This paper
shows the results of fault injection in the MemProc core
processor and in a RISC machine, and compares
performance and area of both approaches. Experimental
results show a 29 times increase in fault tolerance, with
up to 3.75 times in performance gains and 14 times less
sensible area.
-
Accurate and Scalable Reliability Analysis of Logic Circuits [p. 1454]
-
M.R. Choudhury and K. Mohanram
Reliability of logic circuits is emerging as an important concern that
may limit the benefits of continued scaling of process technology
and the emergence of future technology alternatives. Reliability
analysis of logic circuits is NP-hard because of the exponential
number of inputs, combinations and correlations in gate failures,
and their propagation and interaction at multiple primary outputs.
By coupling probability theory with concepts from testing and logic
synthesis, this paper presents accurate and scalable algorithms for
reliability analysis of logic circuits. Simulation results for several
benchmark circuits demonstrate the accuracy, performance, and potential
applications of the proposed analysis technique.
-
A New Asymmetric SRAM Cell to Reduce Soft Errors and Leakage Power in FPGA [p. 1460]
-
B.S. Gill, C. Papachristou and F.G. Wolff
Soft errors in semiconductor memories occur due
to charged particle strikes at the cell nodes. In this paper, we
present a new asymmetric memory cell to increase the soft error
tolerance of SRAM. At the same time, this cell can be used at
the reduced supply voltage to decrease the leakage power without
significantly increasing the soft error rate of SRAM. A major use
of this cell is in the configuration memory of FPGA. The cell is
designed using a 70nm process technology and verified using
Spice simulations. Soft error tolerance results are presented and
compared with standard SRAM cell and an existing increased
soft error tolerance cell. Simulation results show that our cell
has lowest soft error rate at the various supply voltages.
Organizers/Moderators: P. Magarshack, STMicroelectronics, FR; E. Schutz, STMicroelectronics, BE
-
Design Challenges at 65nm and Beyond [p. 1466]
-
A.B. Kahng
Semiconductor manufacturing technology faces evergreater
challenges of pitch, mobility, variability, leakage,
and reliability. To enable cost-effective continuation of
the semiconductor roadmap, there is greater need for
design technology to provide "equivalent scaling", and for
product-specific design innovation (multi-core
architecture, software support, beyond-die integration,
etc.) to provide "more than Moore" scaling. Design
challenges along the road to 45nm include variability and
power management, and leverage of design-manufacturing
synergies. Potential solutions include "design for
manufacturability" bridges between chip implementation
and manufacturing know-how.
-
The ARTEMIS Cross-Domain Architecture for Embedded Systems [p. 1468]
-
H. Kopetz
Today the embedded system market is a highly
fragmented market, where custom-designed solutions
dominate, resulting in a significant duplication of
development effort for hardware, software and services.
The ever-increasing complexity level of embedded
systems, the technology trends of the semiconductor
industry to large production series of chips, and the
increased competition in the world market entail the need
for a European-wide coherent and integrated development
strategy for embedded systems. The ARTEMIS
technology platform has been created to fill this need by
joining the forces of many of the European players in the
embedded system market in order to create the critical
mass that is necessary to tackle the formidable challenges
of the field.
-
HW/SW Implementation from Abstract Architecture Models [p. 1470]
-
A.A. Jerraya
The evolution of technologies is enabling the
integration of complex platforms in a single chip; called
System-on-Chip, SoC. Modern SoC may include one or
several CPU subsystems to execute software and
sophisticated interconnect in addition to specific hardware
subsystems. This is no more an advanced research topic
for academia. 90% of SoCs designed since the start of the
130nm process include at least one CPU. Multimedia
platforms (e.g. Nomadik and Nexperia) are already multiprocessor
systems-on-chip (MPSoCs) using different kinds
of programmable processors (e.g. DSPs and
microcontrollers). This trend of building heterogeneous
multi-processor SoCs will even accelerate. It is easy to
imagine that the design of a SoC with more than a hundred
processors will become a current practice in a few years
time, e.g. with 45nm technology in 2008. Compared with
conventional ASIC design, such a multi-processor SoC is
a fundamental change in chip design. These chips will
include very sophisticated interconnect such as networks-on-chips
(NoC). Moreover, to achieve the required
communication performances, each processor may use
different local architectures and communication schemes
(fast links, non standard memory organization and access).
Moderators: T.-W. Kuo, National Taiwan U, ROC ; H. van Someren, ACE Associated Compiler Experts, NL
-
Instruction-Set Customization for Real-Time Embedded Systems [p. 1472]
-
H.P. Huynh and T. Mitra
Application-specific customization of the instruction set
helps embedded processors achieve significant performance
and power efficiency. In this paper, we explore customization
in the context of multi-tasking real-time embedded systems.
We propose efficient algorithms to select the optimal
set of custom instructions for a task set under two popular
real-time scheduling policies. Our algorithms minimize
the processor utilization through customization while satisfying
the task deadlines and the constraint on silicon area.
Experimental evaluation with various task sets shows that
appropriate customization can achieve significant reduction
in the processor utilization and the energy consumption.
-
A Novel Technique to Use Scratch-pad Memory for Stack Management [p. 1478]
-
S. Park, H.-W. Park and S. Ha
Extensive work has been done for optimal management
of scratch-pad memory (SPM) all assuming that the SPM
is assigned a fixed address space. The main target objects
to be placed on the SPM have been code and global memory
since their sizes and locations are not changed dynamically.
We propose a novel idea of dynamic address
mapping of SPM with the assistance of memory management
unit (MMU). It allows us to use SPM for stack management
without architecture modification and complier
assistance. The proposed technique is orthogonal to the
previous works so can be used at the same time. Experiments
results show that the proposed technique results in
average performance improvement of 13% and energy savings
of 12% observed compared to using only external
DRAM. And it also gives noticeable speed up and energy
saving against a typical cache solution for stack data.
-
Scratchpad Memories vs Locked Caches in Hard Real-Time Systems: A Quantitative
Comparison [p. 1484]
-
I. Puaut and C. Pais
We propose in this paper an algorithm for off-line selection
of the contents of on-chip memories. The algorithm supports
two types of on-chip memories, namely locked caches
and scratchpad memories. The contents of on-chip memory,
although selected off-line, is changed at run-time, for the sake
of scalability with respect to task size. Experimental results
show that the algorithm yields to good ratios of on-chip memory
accesses on the worst-case execution path, with a tolerable
reload overhead, for both types of on-chip memories. Furthermore,
we highlight the circumstances under which one
type of on-chip memory is more appropriate than the other
depending of architectural parameters (cache block size) and
application characteristics (basic block size).
-
Task Scheduling for Reliable Cache Architectures of Multiprocessor Systems [p. 1490]
-
M. Sugihara, T. Ishihara and K. Murakami
This paper presents a task scheduling method for reliable
cache architectures (RCAs) of multiprocessor systems.
The RCAs dynamically switch their operation modes for reducing
the usage of vulnerable SRAMs under real-time constraints.
A mixed integer programming model has been built
for minimizing vulnerability under real-time constraints.
Experimental results have shown that our task scheduling
method achieved 47.7-99.9% less vulnerability than a conventional
approach.
Moderators: L. Daniel, Massachusetts Institute of Technology, US; L.M. Silveira, TU Lisbon, PT
-
Fast Positive-Real Balanced Truncation of Symmetric Systems Using Cross Riccati Equations [p. 1496]
-
N. Wong
We present a computationally efficient implementation
of positive-real balanced truncation (PRBT) for symmetric
multiple-input multiple-output (MIMO) systems. The solution
of a pair of algebraic Riccati equations (AREs) in conventional
PRBT, whose complexity limits practical largescale
realization, is replaced with the solution of one cross
Riccati equation (XRE). The cross-Riccatian solution then
permits simple construction of projection matrices without
actually balancing the system. The method encompasses
passive linear networks, as commonly used in interconnect
and package modelings, due to their inherent reciprocity
and therefore symmetric transfer functions. Effectiveness of
the proposed approach is verified by numerical examples.
-
Random Sampling of Moment Graph: A Stochastic Krylov-Reduction Algorithm [p. 1502]
-
Z. Zhu and J. Phillips
In this paper we introduce a new algorithm for model order reduction
in the presence of parameter or process variation. Our analysis
is performed using a graph interpretation of the multi-parameter
moment matching approach, leading to a computational technique
based on Random Sampling ofMoment Graph (RSMG). Using this
technique, we have developed a new algorithm that combines the
best aspects of recently proposed parameterized moment-matching
and approximate TBR procedures. RSMG attempts to avoid both
exponential growth of computational complexity and multiple matrix
factorizations, the primary drawbacks of existing methods, and
illustrates good ability to tailor algorithms to apply computational
effort where needed. Industry examples are used to verify our new
algorithms.
-
Statistical Model Order Reduction for Interconnect Circuits Considering Spatial Correlations [p. 1508]
-
J. Fan, N. Mi, S.X.-D. Tan, Y. Cai and X. Hong
In this paper, we propose a novel statistical model order reduction
technique, called statistical spectrum model order reduction (SSMOR)
method, which considers both intra-die and inter-die process
variations with spatial correlations. The SSMOR generates orderreduced
variational models based on given variational circuits. The
reduced model can be used for fast statistical performance analysis
of interconnect circuits with variational input sources, such
as power grid and clock networks. The SSMOR uses statistical
spectrum method to compute the variational moments and Monte
Carlo sampling method with the modified Krylov subspace reduction
method to generate the variational reduced models. To consider
spatial correlations, we apply orthogonal decomposition to
map the correlated random variables into independent and uncorrelated
variables. Experimental results show that the proposed method
can deliver about 100x speedup over the pureMonte Carlo projectionbased
reduction method with about 2% of errors for both means and
variances in statistical transient analysis.
-
A Sparse Grid Based Spectral Stochastic Collocation Method for Variations-Aware Capacitance
Extraction of Interconnects under Nanometer Process Technology [p. 1514]
-
H. Zhu, X. Zeng, W. Cai, J. Xue and D. Zhou
In this paper, a Spectral Stochastic Collocation Method
(SSCM) is proposed for the capacitance extraction of interconnects
with stochastic geometric variations for nanometer
process technology. The proposed SSCM has several
advantages over the existing methods. Firstly, compared
with the PFA (Principal Factor Analysis) modeling of geometric
variations, the K-L (Karhunen-Loeve) expansion involved
in SSCM can be independent of the discretization of
conductors, thus significantly reduces the computation cost.
Secondly, compared with the perturbation method, the stochastic
spectral method based on Homogeneous Chaos expansion
has optimal (exponential) convergence rate, which
makes SSCM applicable to most geometric variation cases.
Furthermore, Sparse Grid combined with a MST (Minimum
Spanning Tree) representation is proposed to reduce
the number of sampling points and the computation time
for capacitance extraction at each sampling point. Numerical
experiments have demonstrated that SSCM can achieve
higher accuracy and faster convergence rate compared with
the perturbation method.
-
Simulation Methodology and Experimental Verification for the Analysis of Substrate Noise on LC-VCO's [p. 1520]
-
S. Bronckers, C. Soens, G. Van Der Plas, G. Vandersteen and Y. Rolain
This paper presents a methodology for the analysis
and prediction of the impact of wideband substrate noise on
a LC-Voltage Controlled Oscillator (LC-VCO) from DC up to
Local Frequency (LO). The impact of substrate noise is modeled
a priori in a high-ohmic 0.18μm 1P6M CMOS technology and
then verified on silicon on a 900MHz LC-VCO. Below a frequency
of 10MHz, the impact is dominated by the on-chip resistance of
the VCO ground, while above 10MHz the bond wires, parasitics
of the on-chip inductor and the PCB decoupling capacitors
determine the behavior of the perturbation.
Moderators: C. Silvano, Politecnico di Milano, IT; E. Schmidt, ChipVision Design Systems, DE
-
Accurate Temperature-Dependent Integrated Circuit Leakage Power Estimation Is Easy [p. 1526]
-
Y. Liu, R.P. Dick, L. Shang and H. Yang
It has been the conventional assumption that, due to the
superlinear dependence of leakage power consumption on temperature,
and widely varying on-chip temperature profiles, accurate leakage estimation
requires detailed knowledge of thermal profile. Leakage power
depends on integrated circuit (IC) thermal profile and circuit design
style. We show that linear models can be used to permit highly-accurate
leakage estimation over the operating temperature ranges in real ICs.
We then show that for typical IC packages and cooling structures, a
given amount of heat introduced at any position in the active layer will
have similar impact on the average temperature of the layer. These
two observations allow us to prove that, for wide ranges of design
styles and operating temperatures, extremely fast, coarse-grained thermal
models, combined with linear leakage power consumption models, permit
highly-accurate system-wide leakage power consumption estimation. The
results of our proofs are further confirmed via comparisons with
leakage estimation based on detailed, time-consuming thermal analysis
techniques. Experimental results indicate that the proposed technique
yields a 59,259x-1,790,000x speedup in leakage power estimation while
maintaining accuracy.
-
Low-Overhead Circuit Synthesis for Temperature Adaptation Using Dynamic Voltage Scheduling [p. 1532]
-
S. Ghosh, S. Bhunia and K. Roy
Increasing power density causes die overheating due to
limited cooling capacity of the package. Conventional thermal
management techniques e.g. logic shutdown, clock gating, frequency
scaling, simultaneous voltage-frequency tuning etc. increase the
design complexity and/or degrade the performance significantly. In
this paper, we propose a novel design technique, which makes a
circuit amenable to temperature adaptation using dynamic voltage
scheduling (DVS). It is accomplished by a synthesis technique that
(a) isolates and predicts the set of paths that may become critical
under variations, (b) ensures they are activated rarely, and (c)
tolerates possible delay failures (at reduced voltage) in these paths
by adaptive clock stretching. This allows us to schedule a lower
supply voltage during increased temperature without requiring
frequency tuning. Simulation results on an example pipeline show
that proposed design yields similar temperature reduction as
conventional design with only 11% performance penalty and 14%
area overhead. The conventional pipeline design, on contrary, leads
to 50% performance degradation due to reduced operating
frequency.
-
Maximum Circuit Activity Estimation Using Pseudo-Boolean Satisfiability [p. 1538]
-
H. Mangassarian, A. Veneris, S. Safarpour, F.N. Najm and M.S. Abadir
Disproportionate instantaneous power dissipation
may result in unexpected power supply voltage fluctuations and
permanent circuit damage. Therefore, estimation of maximum
instantaneous power is crucial for the reliability assessment of
VLSI chips. Circuit activity and consequently power dissipation
in CMOS circuits are highly input-pattern dependent, making the
problem of maximum power estimation computationally hard.
This work proposes a novel pseudo-boolean satisfiability based
method that reports the exact input sequence maximizing circuit
activity in combinational and sequential circuits. The method
is also extended to take multiple gate transitions into account
by integrating delay information into the pseudo-boolean optimization
problem. An extensive suite of experiments on ISCAS85
and ISCAS89 circuits confirms the efficiency and robustness of
the approach compared to simulation based techniques and encourages
further research for low-power solutions using boolean
satisfiability.
-
Efficient Computation of Discharge Current Upper Bounds for Clustered Sleep Transistor Sizing [p. 1544]
-
A. Sathanur, A. Calimera, L. Benini, A. Macii, E. Macii and M. Poncino
Sleep transistor insertion is a key step in low power design
methodologies for nanometer CMOS. In the clustered
sleep transistor approach, a single sleep transistor is shared
among a number of gates and it must be sized according to
the maximum current that can be injected onto the virtual
ground by the gates in the cluster. A conservative (upper
bound) estimate of the maximuminjected current is required
in order to avoid excessive speed degradation and possible
violations of timing constraints. In this paper we propose
a scalable algorithm for tightening upper bound computation,
with a controlled and tunable computational cost.
The algorithm leverages the capabilities of state-of-the-art
commercial timing analysis engines, and it is tightly integrated
into standard industrial flow for leakage optimization.
Benchmark results demonstrate the effectiveness and
efficiency of our approach.
-
Processor Tolerant Beta-Ratio Modulation for Ultra-Dynamic Voltage Scaling [p. 1550]
-
M.-E. Hwang, T. Cakici and K. Roy
Most wireless and hand-held gadgets work in burst
mode, and the performance demand varies with time. When
the performance requirement is low, the supply voltage can
be dithered and the circuit can enter from superthreshold
region to subthreshold region (Vdd < VT). Such ultra dynamic
voltage scaling (UDVS), where the supply voltage
switches from 1.2V to 200mV (say), enables remarkable
decrease in power consumption with "acceptable" performance
penalty in the non-burst mode of operation. However,
subthreshold operation is very sensitive to process
variation (PV) due to the reduced noise margin, and may
not work properly unless corrective measures are taken. In
this paper, we model the trip voltage in both subthreshold
and superthreshold regions, and analyze the impact of PV
in UDVS. We also propose a circuit design technique such
that the same logic gate can efficiently operate in both superthreshold
and subthreshold regions under PV. We do that
by modulating the β-ratio (P-to-N ratio) of the logic gates.
By proper β-ratio modulation, we show that the proposed
methodologies can lower energy dissipation per cycle by
more than an order of magnitude (42X) in non-burst mode
with reduced impact to PVs.
Organizers: S. Prudhomme, Airbus, FR; E. Lansard, Alcatel Alenia Space, FR
Moderator: P. Aycinena, Editor, EDA Confidential, US
-
Towards Total Open Source in Aeronautics and Space? [p. 1556]
-
Panelists: E. Bantegnie, G. Ladier, R. Mueller, F. Gasperoni and A. Wilson
Aeronautics and space are extraordinarily technical fields of engineering and science that reside within
a niche characterized by unique end-product requirements. The severe operating conditions in flight or
in space, in combination with the need for mission-critical reliability, create a difficult and challenging
level of expectation for those who develop the hardware and software that goes into systems for
aeronautics and space.
Moderators: C. Grassmann, Infineon Technologies, DE ; O. Deprez, Texas Instruments, FR
-
A Tiny and Efficient Wireless Ad-hoc Protocol for Low-cost Sensor Networks [p. 1557]
-
P. Gburzynski, B. Kaminska and W. Olesinski
We introduce a simple ad-hoc routing scheme that operates
in the true spirit of ad-hoc networking, i.e., in a modeless
fashion, without neighborhood discovery or explicit
point-to-point forwarding, while offering a high (and tunable)
degree of reliability, fault-tolerance and robustness.
Being aimed at truly tiny devices (e.g., with 1KB of RAM),
our scheme can automatically take advantage of extra memory
resources to improve the quality of routes for critical
nodes. In contrast to some popular low-cost solutions, like
ZigBee,TM our approach involves a single node type and exhibits
lower resource requirements. The presented scheme
has been verified in an industrial deployment with stringent
quality of service requirements.
-
Scalable Reconfigurable Channel Decoder Architecture for Future Wireless Handsets [p. 1563]
-
G. Krishnaiah, N. Engin and S. Sawitzki
The current trend in the consumer devices and communication
service provider market is the integration of different
communication standards within a single device (e.g.
GSM phone with Bluetooth, WLAN and infrared interface)
requiring tight integration of mobile broadcast, networking
and cellular technologies within one product. Channel decoder
is traditionally one of the most computationally intensive
building block within digital receivers. The aim of
this paper is to investigate the feasibility of a programmable
channel decoder that can be dynamically reconfigured for
decoding turbo and convolutionally encoded streams from
various wireless standards. The architecture options are
presented and the area costs and flexibility compared between
the options. The resulting decoder architecture supports
hardware resource sharing and reconfiguration between
different standards and decoders and is more efficient
in terms of silicon area than independent implementation of
every decoder on the same IC.
-
A New Pipelined Implementation for Minimum Norm Sorting Used in Square Root
Algorithm for MIMO-VBLAST Systems [p. 1569]
-
Z. Khan, T. Arslan, J.S. Thompson, A.T. Erdogan
Multiple Input - Multiple Output (MIMO) wireless
technology involves highly complex vectors and matrix
computations which are directly related to increased
power and area consumption. This paper proposes an
area and power efficient VLSI architecture that can
serve the dual purpose of minimum norm sorting of
rows as well as upper/lower block tri-angularization of
matrices. The resources inside the architecture are
shared among both operations and only primitive
computations are used. Results indicate saving in
silicon real estate as well as power consumption
compared to previous architecture without degrading
performance.
-
Optimization of the "FOCUS" Inband-FEC Architecture for 10-Gbps SDH/SONET Optical
Communication Channels [p. 1575]
-
A. Tychopoulos and O. Koufopavlou
Forward-Error Correction (FEC) is of key importance
to the robustness of optical communication networks. In
particular, Inband-FEC is an attractive option, because it
improves channel-performance without requiring an
increase of the transmission bandwidth. We have devised
and implemented a novel inband FEC method, dubbed
FOCUS, for the electronic-mitigation of physical
impairments in SDH/SONET optical networks. It is an
inherently low-cost approach for both the metro and
backbone network regions, scalable to any SDH/SONET
rate and capable to significantly increase optical channel
performance. This paper analyzes the most sophisticated
ones from the plethora of optimizations that were
employed to minimize the architectural complexity of
FOCUS, falling in: a) Arithmetic operator design, b)
Resource sharing and c) Redundant logic elimination.
These optimizations were necessary to obtain a prototype,
which eventually permitted the first fully successful
laboratory evaluation of the FOCUS Inband-FEC method.
Moderators: C. Bolchini, Politecnico di Milano, IT; S. Bocchio, STMicroelectronics, IT
-
A Framework for System Reliability Analysis Considering Both System Error Tolerance and
Component Test Quality [p. 1581]
-
S.-J. Pan and K.-T. Cheng
The failure rate, the sources of failures and the test costs
for nanometer devices are all increasing. Therefore, to create
a reliable system-on-a-chip device requires designers to
implement fault tolerance. However, while system-level fault
tolerance could significantly relax the quality requirements of
the system's building blocks, every fault-tolerant scheme only
works under certain failure mechanisms and within a certain
range of error probabilities. Also, designing a system with a
high failure-rate component could be very expensive because
the growth rate of the design complexity and the system overhead
for fault tolerance could be significantly greater than
the component failure rate. Therefore, it is desirable to understand
the trade-offs between component test quality and
system fault-tolerant capability for achieving the desired reliability
under cost constraints. In this paper, we propose an
analysis framework for system reliability considering (a) the
test quality achieved by manufacturing testing, on-line selfchecking,
and off-line built-in self-test; (b) the fault-tolerant
and spare schemes; and (c) the component defect and error
probabilities. We demonstrate that, through proper redundancy
configurations and low-cost testing to insure a certain
degree of component test quality, a low-redundant system
might achieve equal or higher reliability than a highredundant
system.
-
Experimental Evaluation of Protections against Laser-induced Faults and Consequences
on Fault Modeling [p. 1587]
-
R. Leveugle, A. Ammari, V. Maingot, E. Teyssou, P. Moitrel, C. Mourtel, N. Feyt, J.-B. Rigaud
and A. Tria
Lasers can be used by hackers to situations to inject
faults in circuits and induce security flaws. On-line
detection mechanisms are classically proposed to counter
such attacks, and are often based on error detecting
codes. However, the efficiency of such schemes has not
been precisely validated against real attack conditions.
This paper presents results showing that, with a given type
of laser, a classical protection technique can leave open
doors to an attacker. The results give also insights into the
fault models to be taken into account when designing a
secured circuit.
-
Evaluation of Design for Reliability Techniques in Embedded Flash Memories [p. 1593]
-
B. Godard, J.-M. Daga, L. Torres and G. Sassatelli
Non-volatile Flash memories are becoming more and
more popular in Systems-on-Chip (SoC). Embedded Flash
(eFlash) memories are based on the well-known floatinggate
transistor concept. The reliability of such type of
technology is a growing up issue for embedded systems;
endurance and retention are of course the main features to
analyze. To enhance memory reliability current eFlash
memories designs use techniques such as Error Correction
Code (ECC), Redundancy or Threshold Voltage (VT)
Analysis. In this paper, a memory model to evaluate the
reliability of eFlash memory arrays under distinct
enhancement schemes is developed.
-
Reduction of Detected Acceptable Faults for Yield Improvement via Error-Tolerance [p. 1599]
-
T.-Y. Hsieh, K.-J. Lee and M.A. Breuer
Error-tolerance is an innovative way to enhance the
effective yield of IC products. Previously a test
methodology based on error-rate estimation to support
error-tolerance was proposed. Without violating the system
error-rate constraint specified by the user, this
methodology identifies a set of faults that can be ignored
during testing, thereby leading to a significant
improvement in yield. However, usually the patterns
detecting all of the unacceptable faults also detect a large
number of acceptable faults, resulting in a degradation in
achievable yield improvement. In this paper, we first
provide a probabilistic analysis of this problem and show
that a conventional ATPG procedure cannot adequately
address this problem. We then present a novel test pattern
selection procedure and an output masking technique to
deal with this problem. The selection process generates a
test set aimed to detect all unacceptable faults but as few
acceptable faults as possible. The masking technique then
examines the generated test patterns and identifies a list of
output lines that can be masked (not observed) during
testing so as to further avoid the detection of acceptable
faults. Experimental results show that by employing the
proposed techniques, only a small number of acceptable
faults are still detected. In many cases the actual yield
improvement approaches the optimal value that can be
achieved.
Moderators: M. Berkelaar, Magma Design Automation, NL; J. Cortadella, UP Catalunya, ES
-
Use of Statistical Timing Analysis on Real Designs [p. 1605]
-
A. Nardi, E. Tuncer, S. Naidu, A. Antonau, S. Gradinaru, T. Lin and J. Song
A vast literature has been published on Statistical Static
Timing Analysis (SSTA), its motivations, its different implementations
and their runtime/accuracy trade-offs. However,
very limited literature exists ([1]) on the applicability and
the usage models of this new technology on real designs.
This work focuses on the use of SSTA in real designs and
its practical benefits and limitations over the traditional design
flow. We introduce two new metrics to drive the optimization:
skew criticality and aggregate sensitivity.
Practical benefits of SSTA are demonstrated for clock
tree analysis, and correct modeling of on-chip-variations.
The use of SSTA to cover the traditional corner analysis and
to drive optimization is also discussed. Results are reported
on three designs implemented on a 90nm technology.
-
A Novel Criticality Computation Method in Statistical Timing Analysis [p. 1611]
-
F. Wang, Y. Xie and H. Ju
The impact of process variations increases as technology
scales to nanometer region. Under large process variations,
the path and arc/node criticality [18] provide effective
metrics in guiding circuit optimization. To facilitate the criticality
computation considering the correlation, we define the critical
region for the path and arc/node in a timing graph, and
propose an efficient method to compute the criticality for paths
and arcs/nodes simultaneously by a single breadth-first graph
traversal during the backward propagation. Instead of choosing
a set of paths for analysis prematurely, we develop a new
property of the path criticality to prune those paths with low
criticality at very earlier stages, so that our path criticality
computation method has linear complexity with respect of the
timing edges in a timing graph. To improve the computation
accuracy, cutset and path criticality properties are exploited to
calibrate the computation results. The experimental results on
ISCAS benchmark circuits show that our criticality computation
method can achieve high accuracy with fast speed.
-
Efficient Computation of the Worst-Delay Corner [p. 1617]
-
L. Guerra e Silva, L.M. Silveira and J.R. Phillips
Timing analysis and verification is a critical stage in digital
integrated circuit design. As feature sizes decrease to
nanometer scale, the impact of process parameter variations
in circuit performance becomes extremely relevant.
Even though several statistical timing analysis techniques
have recently been proposed, as a form of incorporating
variability effects in traditional static timing analysis, corner
analysis still is the current timing signoff methodology
for any industrial design. Since it is impossible to analyze
a design for all the process corners, due to the exponential
size of the corner space, the design is usually analyzed for
a set of carefully chosen corners, that are expected to cover
all the worst-case scenarios. However, there is no established
systematic methodology for picking the right worstcase
corners, and this task usually relies on the experience
of design and process engineers, many times leading to over
design. This paper proposes an efficient automated methodology
for computing the worst-delay process corners of a
digital integrated circuit, given a linear parametric characterization
of the gate and interconnect delays.
Moderators: I. Puaut, Rennes U/IRISA, FR; S. Baruah, North Carolina U, US
-
Accounting for Cache-Related Preemption Delay in Dynamic Priority Schedulability Analysis [p. 1623]
-
L. Ju, S. Chakraborty and A. Roychoudhury
Recently there has been considerable interest in incorporating
timing effects of microarchitectural features of
processors (e.g. caches and pipelines) into the schedulability
analysis of tasks running on them. Following this line
of work, in this paper we show how to account for the effects
of cache-related preemption delay (CRPD) in the standard
schedulability tests for dynamic priority schedulers
like EDF. Even if the memory space of tasks is disjoint, their
memory blocks usually map into a shared cache. As a result,
task preemption may introduce additional cache misses
which are encountered when the preempted task resumes execution;
the delay due to these additional misses is called
CRPD. Previous work on accounting for CRPD was restricted
to only static priority schedulers and periodic task
models. Our work extends these results to dynamic priority
schedulers and more general task models (e.g. sporadic,
generalized multiframe and recurring real-time). We show
that our schedulability tests are useful through extensive experiments
using synthetic task sets, as well as through a detailed
case study.
-
Energy-Efficient Real-Time Task Scheduling with Task Rejection [p. 1629]
-
J.-J. Chen, T.-W. Kuo, C.-L. Yang and K.-J. King
In the past decade, energy-efficiency has been an important system
design issue in both hardware and software managements. For mobile
applications with critical missions, both energy consumption
reduction and timing guarantee have to be provided by system engineers
to extend operation duration and maintain system stability.
This research explores real-time systems composed of homogeneous
multiple processors with the capability of dynamic voltage scaling
(DVS), in which a given task can be rejected with a specified value
of rejection penalty. The objective is to minimize the summation of
the total rejection penalty for the tasks that are not completed in
time and the energy consumption of the system. This study provides
analysis to show that there does not exist any polynomial-time approximation
algorithm for the studied problem, unless P = NP.
Moreover, we propose algorithms for systems with ideal and nonideal
DVS processors. The capability of the proposed algorithms is
provided with extensive evaluations. The evaluation results reveal
that our proposed algorithms could derive effective solutions of the
energy-efficient scheduling problem with task rejection considerations.
Keywords: Energy-Efficient Scheduling, Task Rejection, Real-
Time Task Scheduling.
-
Feasibility Intervals for Multiprocessor Fixed-Priority Scheduling of Arbitrary Deadline
Periodic Systems [p. 1635]
-
L. Cucu and J. Goossens
In this paper we study the global scheduling of periodic
task systems with arbitrary deadlines upon identical multiprocessor
platforms. We first show two very general properties
which are well-known for uniprocessor platforms and
which remain for multiprocessor platforms: (i) under few
and not so restrictive assumptions, we show that any feasible
schedule of arbitrary deadline periodic task systems
is periodic from some point and (ii) for the specific case of
synchronous periodic task systems, we show that the schedule
repeats from the origin. We then present our main result:
any feasible schedule of asynchronous periodic task
sets using a fixed-priority scheduler is periodic from a specific
point. Moreover, we characterize that point and we
provide a feasibility interval for those systems.
-
Energy Minimization with Soft Real-time and DVS for Uniprocessor and Multiprocessor
Embedded Systems [p. 1641]
-
M. Qiu, C. Xue, Z. Shao and E.H.-M. Sha
Energy-saving is extremely important in real-time embedded
systems. Dynamic Voltage Scaling (DVS) is one of
the prime techniques used to achieve energy-saving. Due
to the uncertainties in execution times of some tasks of systems,
this papermodels each varied execution time as a random
variable. By using probabilistic approach, we propose
two optimal algorithms, one for uniprocessor and one for
multiprocessor to explore soft real-time embedded systems
and avoid over-designing them. Our goal is to minimize
the expected total energy consumption while satisfying the
timing constraint with a guaranteed confidence probability.
The solutions can be applied to both hard and soft real-time
systems. The experimental results show that our approach
achieves significant energy-saving than previous work.
Moderators: R. Marculescu, Carnegie Mellon U, US; D. Atienza, DACYA . Madrid Complutense U, ES
-
Joint Consideration of Fault-Tolerance, Energy-Efficiency and Performance in On-Chip Networks [p. 1647]
-
A. Ejlali, B.M. Al-Hashimi, P. Rosinger and S.G. Miremadi
High reliability against noise, low energy consumption
and high performance are key objectives in the design of
on-chip networks. Recently some researchers have
considered the various trade-offs between two of these
objectives. However, as we will argue later, the three
design objectives should be considered jointly and
simultaneously. The first aim of this paper is to analyze
the impact of various error-control schemes on the
simultaneous trade-off between reliability, performance
and energy when voltage swing varies. We provide a
detailed comparative analysis of the error-control schemes
using analytical models and SPICE simulations. The
second aim of this paper is to analyze the impact of noise
power and time constraint on the effectiveness of errorcontrol
schemes, which have not been addressed in
previous studies.
-
Impact of Process Variations on Multicore Performance Symmetry [p. 1653]
-
E.B. Humenay, D. Tarjan and K. Skadron
Multi-core architectures introduce a new granularity
at which process variations may occur, yielding asymmetry
among cores that were designed - and that software
expects - to be symmetric in performance. The chief source
of this phenomenon are highly correlated, "systematic"
within-die variations such as optical imperfections yielding
variations across the exposure field. Per-core voltages can
be used to bring all cores to the same performance level, but
this compensation strategy also affects power, chiefly due to
leakage power. Boosting a core's frequency may therefore
boost its leakage sufficiently to engage thermal throttling.
This sets up a tradeoff between static performance asymmetry
due to frequency variation versus dynamic performance
asymmetry due to thermal throttling. This paper explores
the potential magnitude of these effects.
-
Temperature Aware Task Scheduling in MPSoCs [p. 1659]
-
A. Kivilcim Coskun, T. Simunic Rosing and K. Whisnant
In deep submicron circuits, elevation in temperatures has
brought new challenges in reliability, timing, performance,
cooling costs and leakage power. Conventional thermal management
techniques sacrifice performance to control the thermal
behavior by slowing down or turning off the processors
when a critical temperature threshold is exceeded. Moreover,
studies have shown that in addition to high temperatures, temporal
and spatial variations in temperature impact system reliability.
In this work, we explore the benefits of thermally
aware task scheduling for multiprocessor systems-on-a-chip
(MPSoC). We design and evaluate OS-level dynamic scheduling
policies with negligible performance overhead. We show
that, using simple to implement policies that make decisions
based on temperature measurements, better temporal and spatial
thermal profiles can be achieved in comparison to stateof-
art schedulers. We also enhance reactive strategies such as
dynamic thread migration with our scheduling policies. This
way, hot spots and temperature variations are decreased, and
the performance cost is significantly reduced.
Moderators: R. Zafalon, STMicroelectronics, IT; J. Haid, Infineon Technologies, DE
-
Architectural Leakage-Aware Management of Partitioned Scratchpad Memories [p. 1665]
-
O. Golubeva, M. Loghi, M. Poncino and E. Macii
Partitioning a memory into multiple blocks that can be independently
accessed is a widely used technique to reduce its dynamic
power. For embedded systems, its benefits can be even pushed further
by properly matching the partition to the memory access patterns.
When leakage energy comes into play, however, idle memory
blocks must be put into a proper low-leakage sleep state to actually
save energy when not accessed. In this case, the matching becomes
an instance of power management problem, because moving to and
from this sleep state requires additional energy.
In this work, we propose an explorative solution to the problem of
leakage-aware partitioning of a memory into disjoint sub-blocks.
In particular, we target scratchpad memories, which are commonly
used in some embedded systems as a replacement of caches.
We show that the total energy (dynamic and static) cost function
yields a non-convex partitioning space, making smart exploration
the only viable option; we propose an effective randomized search
in the solution space which has very good match with the results of
exhaustive exploration, when this is feasible.
Experiments on a different sets of embedded applications has
shown that total energy savings larger than 60% on average can
be obtained, with a marginal overhead in execution time, thanks to
an effective implementation of the low-leakage sleep state.
-
Memory Bank Aware Dynamic Loop Scheduling [p. 1671]
-
M. Kandemir, T. Yemliha, S.W. Son and O. Özturk
In a parallel system with multiple CPUs, one of the key prob-
lems is to assign loop iterations to processors. This problem
known as the loop scheduling problem, has been studied in the
past, and several schemes, both static and dynamic, have been pro-
posed. One of the attractive features of dynamic schemes, as com-
pared to their static counterparts, is their ability of exploiting the
latency variations across the execution times of the different loop
iterations. In all the dynamic loop scheduling techniques proposed
in literature so far, performance has been the primary metric of
interest. In a battery-operated embedded execution environment,
however, power consumption is another metric to consider dur-
ing iteration-to-processor assignment. In particular, in a banked
memory system, this assignment can have an important impact on
memory power consumption, which can be a significant portion
of the overall energy consumption, especially for data-intensive
embedded applications such as those from the domain of image
data processing. This paper presents a bank aware dynamic loop
scheduling scheme for array-intensive embedded media applica-
tions. The goal behind this new scheduling scheme is to minimize
the number of memory banks that need to be used for executing
the current working set (group of loop iterations) when all proces-
sors are considered together. That is, during the loop iteration-
to-processor assignment, our approach considers the bank access
patterns of loop iterations and carefully selects the set of itera-
tions to assign to an idle processor so that, if possible, the num-
ber of memory banks that are used at the current state is not in-
creased. Our experimental results show that the proposed schedul-
ing scheme leads to much better energy results when compared to
prior loop scheduling techniques and it is also competitive with the
scheduler that generates the best performance. To our knowledge,
this is the first dynamic loop scheduling scheme that is memory
bank aware.
-
System Level Clock Tree Synthesis for Power Optimization [p. 1677]
-
S.A. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch and E. Schmidt
The clock tree is the interconnect net on Systems-on-Chip
(SoCs) with the heaviest load and consumes up to 40% of
the overall power budget. Substantial savings of the overall
power dissipations are possible by optimizing the clock
tree. Although these savings are already relevant at systemlevel,
only little effort has been made to consider the clock
tree at higher levels of abstraction. This paper shows how
the clock-tree can be integrated into system-level power estimation
and optimization. A clock tree routing algorithm is
chosen, adapted to the system-level and then integrated into
an algorithmic-level power optimization tool. Experimental
results demonstrate the importance of the clock tree for
system-level power optimization.
|