| |
DATE 2009 ABSTRACTS
Sessions:
[Keynote Addresses]
[2.2]
[2.3]
[2.4]
[2.5]
[2.6]
[2.7]
[2.8]
[3.2]
[3.3]
[3.4]
[3.5]
[3.6]
[3.7]
[3.8]
[IP1]
[4.2]
[4.3]
[4.4]
[4.5]
[4.6]
[4.7]
[4.8]
[5.1]
[5.2]
[5.3]
[5.4]
[5.5]
[5.6]
[5.7]
[IP2]
[6.1.1]
[6.1.2]
[6.2]
[6.3]
[6.4]
[6.5]
[6.6]
[6.7]
[6.8]
[7.1]
[7.2]
[7.3]
[7.4]
[7.5]
[7.6]
[7.7]
[7.8]
[IP3]
[8.1]
[8.2]
[8.3]
[8.4]
[8.5]
[8.6]
[8.7]
[8.8]
[9.1]
[9.2]
[9.3]
[9.4]
[9.5]
[9.6]
[9.7]
[9.8]
[IP4]
[10.1.1]
[10.1.2]
[10.2]
[10.3]
[10.4]
[10.5]
[10.6]
[10.7]
[11.1]
[11.2]
[11.3]
[11.4]
[11.5]
[11.6]
[11.7]
[11.8]
[IP5]
[12.1]
[12.2]
[12.3]
[12.4]
[12.5]
[12.6]
[12.7]
DATE Executive Committee
DATE Sponsor Committee
Technical Program Chairs
Technical Program Committee
Reviewers
Foreword
Best Paper Awards
Tutorials
PH.D. Forum
Call for Papers: DATE 2010
-
Has Anything Changed in Electronic Design Since 1983? [p. 1]
-
M. Muller
In this talk Mike answers the question, has anything changed in electronic design since 1983. That was the year in
which work began on the design of the first ARM micro processor chip set for a home computer. Leading edge
semiconductor manufacturing was at 3000 nanometers (nm) and COBOL was the world's most popular
programming language. 25 years later has system design really changed that much when today's Hi-Fi remote
control has the same architecture as the original home computers? Is designing for a 32 nm process any different?
Does programming in Java and JavaScript change anything? Recent changes in the power/performance scaling of
semiconductor processes and the increase in variability fundamentally challenges the design assumption we have
been comfortable with for so many years and requires new approaches to system architecture, micro architecture and
device architecture. Changing consumer expectations also require product manufactures to increasingly provide
services to complete their offerings dramatically changing the importance that software plays in the design process.
-
Embedded Systems Design - Scientific Challenges and Work Directions [p. 2]
-
J. Sifakis
The development of a satisfactory Embedded Systems Design Science provides a timely challenge and opportunity
for reinvigorating Computer Science.
Embedded systems are components integrating software and hardware jointly and specifically designed to provide
given functionalities, which are often critical. They are used in many applications areas including transport,
consumer electronics and electrical appliances, energy distribution, manufacturing systems etc.
Embedded systems design requires techniques taking into account extra-functional requirements regarding optimal
use of resources such as time, memory and energy while ensuring autonomy, reactivity and robustness.
Jointly taking into account these requirements raises a grand scientific and technical challenge extending Computer
Science with paradigms and methods from Control Theory and Electrical Engineering. Computer Science is based
on discrete computation models not encompassing physical time and resources which are by their nature very
different from analytic models used by other engineering disciplines.
We summarise some current trends in embedded systems design and point out some of their characteristics, such as
the chasm between analytical and computational models and the gap between safety critical and best-effort
engineering practices. We call for a coherent scientific foundation for embedded systems design, and we discuss a
few key demands on such a foundation: the need for encompassing several manifestations of heterogeneity, and the
need for design paradigms ensuring constructivity and adaptivity. We discuss main aspects of this challenge and
associated research directions for different areas such as modelling, programming, compilers, operating systems and
networks.
Moderators: P Paulin, STMicroelectronics, FR; G Nicolescu, Polytechnique Montreal, CA
-
A Low-Power Fat Tree-Based Optical Network-on-Chip for Multiprocessor System-on-Chip
[p. 3]
-
H. Gu, J. Xu and W. Zhang
Multiprocessor system-on-chip (MPSoC) is an
attractive platform for high-performance applications.
Networks-on-Chip (NoCs) can improve the on-chip communication
bandwidth of MPSoCs. However, traditional metallic
interconnects consume significant amount of power to deliver
even higher communication bandwidth required in the near
future. Optical NoCs are based on CMOS-compatible optical
waveguides and microresonators, and promise significant
bandwidth and power advantages. This paper proposes a fat
tree-based optical NoC (FONoC) including its topology, floorplan,
protocols, and a low-power and low-cost optical router, optical
turnaround router (OTAR). Different from other optical NoCs,
FONoC does not require building a separate electronic NoC for
network control. It carries both payload data and network
control data on the same optical network, while using circuit
switching for the former and packet switching for the latter. The
FONoC protocols are designed to minimize network control data
and the related power consumption. An optimized turnaround
routing algorithm is designed to utilize the low-power feature of
OTAR, which can passively route packets without powering on
any microresonator in 40% of all cases. Comparing with other
optical routers, OTAR has the lowest optical power loss and uses
the lowest number of microresonators. An analytical model is
developed to characterize the power consumption of FONoC. We
compare the power consumption of FONoC with a matched
electronic NoC in 45 nm, and show that FONoC can save 87%
power comparing with the electronic NoC on a 64-core MPSoC.
We simulate the FONoC for the 64-core MPSoC and show the
end-to-end delay and network throughput under different
offered loads and packet sizes.
-
SunFloor 3D: A Tool for Networks on Chip Topology Synthesis for 3D Systems on Chips
[p. 9]
-
C. Seiculescu, S. Murali, L. Benini and G De Micheli
Three-dimensional integrated circuits are a promising approach to
address the integration challenges faced by current Systems on Chips
(SoCs). Designing an efficient Network on Chip (NoC) interconnect
for a 3D SoC that not only meets the application performance
constraints, but also the constraints imposed by the 3D technology,
is a significant challenge. In this work we present a design tool,
SunFloor 3D, to synthesize application-specific 3D NoCs. The proposed
tool determines the best NoC topology for the application,
finds paths for the communication flows, assigns the network components
on to the 3D layers and performs a placement of them in
each layer. We perform experiments on several SoC benchmarks
and present a comparative study between 3D and 2D NoC designs.
Our studies show large improvements in interconnect power consumption
(average of 38%) and delay (average of 13%) for the
3D NoC when compared to the corresponding 2D implementation.
Our studies also show that the synthesized topologies result in large
power (average of 54%) and delay savings (average of 21%) when
compared to standard topologies. Keywords
3D ICs, Networks on chip (NoC), synthesis, topology, placement
-
User-Centric Design Space Exploration for Heterogeneous Network-on-Chip Platforms
[p. 15]
-
C.-L. Chou and R. Marculescu
In this paper, we present a design methodology
for automatic platform generation of future heterogeneous
systems where communication happens via the Network-on-Chip
(NoC) approach. As a novel contribution, we consider
explicitly the information about the user experience into a
design flow which aims at minimizing the workload variance;
this allows the system to better adapt to different types of user
needs and workload variations. More specifically, we first collect
various user traces from various applications and generate
specific clusters using machine learning techniques. For
each cluster of such user traces, depending on the architectural
parameters extracted from high-level specifications, we
propose an optimization method to generate the NoC system
architecture. Finally, we validate the user-centric design space
exploration using realistic traces and compare it to the traditional
NoC design methodology.
-
A Highly Resilient Routing Algorithm for Fault-Tolerant NoCs
[p. 21]
-
D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester and D. Blaauw
Current trends in technology scaling foreshadow worsening transistor
reliability as well as greater numbers of transistors in each
system. The combination of these factors will soon make long-term
product reliability extremely difficult in complex modern systems
such as systems on a chip (SoC) and chip multiprocessor (CMP)
designs, where even a single device failure can cause fatal system
errors. Resiliency to device failure will be a necessary condition at
future technology nodes. In this work, we present a network-onchip
(NoC) routing algorithm to boost the robustness in interconnect
networks, by reconfiguring them to avoid faulty components
while maintaining connectivity and correct operation. This distributed
algorithm can be implemented in hardware with less than
300 gates per network router. Experimental results over a broad
range of 2D-mesh and 2D-torus networks demonstrate 99.99% reliability
on average when 10% of the interconnect links have failed.
Moderators: G. Sassatelli, LIRMM, FR; M. Huebner, Karlsruhe U, DE
-
Mapping of a Film Grain Removal Algorithm to a Heterogeneous Reconfigurable Architecture
[p. 27]
-
S. Whitty, H. Sahlbach, R. Ernst and W. Putzke-Roeming
Despite recent advances in FPGA, GPU, and general
purpose processor technologies, the challenges posed by realtime
digital image processing at high resolutions cannot be fully
overcome due to insufficient processing capability, inadequate
data transport and control mechanisms, and often prohibitively
high costs. To address these issues, we proposed a two-phase
solution for a real-time film grain noise reduction application.
The first phase is based on a state-of-the-art FPGA platform
used as a reference design. The second phase is based on a novel
heterogeneous reconfigurable computing platform that offers
flexibility not available from other computing paradigms. This
paper introduces the heterogeneous platform and briefly reviews
our previous work with the application in question, as well as its
implementation on the FPGA demonstration board during the
first phase. Then we present a decomposition of the application,
which allows an efficient mapping to the new heterogeneous
computing platform through the use of its diverse reconfigurable
computing units and run-time reconfiguration.
-
An ILP Formulation for Task Mapping and Scheduling on Multi-Core Architectures
[p. 33]
-
Y. Yi, W. Han, X. Zhao, A.T. Erdogan and T. Arslan
Multi-core architectures are increasingly being adopted
in the design of emerging complex embedded systems. Key issues
of designing such systems are on-chip interconnects, memory
architecture, and task mapping and scheduling. This paper
presents an integer linear programming formulation for the task
mapping and scheduling problem. The technique incorporates
profiling-driven loop level task partitioning, task transformations,
functional pipelining, and memory architecture aware data
mapping to reduce system execution time. Experiments are
conducted to evaluate the technique by implementing a series of
DSP applications on several multi-core architectures based on
dynamically reconfigurable processor cores. The results
demonstrate that the proposed technique is able to generate
high-quality mappings of realistic applications on the target
multi-core architecture, achieving up to 1.3x parallel efficiency by
employing only two dynamically reconfigurable processor cores.
-
DPR in High Energy Physics
[p. 39]
-
W. Gao, A. Kugel, R. Maenner, N. Abel, N. Meier and U. Kebschull
The Active Buffer project is part of the CBM (compressed
baryonic matter) experiment and takes advantage
of the DPR (dynamic partial reconfiguration) technology,
in which a dynamic module can be reconfigured while the
static part and other dynamic modules keep running untouched.
Due to DPR, design flexibility and simplicity are
achieved at the same time. The correctness and the performance
have been verified by multiple tests.
-
A Flexible Layered Architecture for Accurate Digital Baseband Algorithm Development
and Verification
[p. 45]
-
A. Alimohammad, S. Fouladi Fard and B.F. Cockburn
Many emerging communication technologies significantly
increase the complexity of the physical layer and have
dramatically increased the number of operating configurations.
To ensure maximum performance, designers have to optimize
their algorithm implementations, which requires for comprehensive
performance testing in all possible operating modes
various channel conditions. This paper presents a flexible and
affordable framework for baseband algorithm development and
performance verification for digital communication systems with
an arbitrary number of modules, each operating at a possibly
different sampling rate with various latencies. The proposed
architecture is scalable to support complex scenarios, such
as multiple antenna systems, and is compact enough to be
implemented within a single field-programmable gate array.
Moderators: J. Teich, University of Erlangen-Nuremberg, DE; P. Marwedel, TU Dortmund. DE
-
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms
[p. 51]
-
L. Huang, F. Yuan and Q. Xu
With the relentless scaling of semiconductor technology, the lifetime
reliability of embedded multiprocessor platforms has become one of
the major concerns for the industry. If this is not taken into consideration
during the task allocation and scheduling process, some
processors might age much faster than the others and become the
reliability bottleneck for the system, thus significantly reducing the
system's service life. To tackle this problem, in this paper, we propose
an analytical model to estimate the lifetime reliability of multiprocessor
platforms when executing periodical tasks, and we present a
novel lifetime reliability-aware task allocation and scheduling algorithm
based on simulated annealing technique. In addition, to speed
up the annealing process, several techniques are proposed to simplify
the design space exploration process with satisfactory solution quality.
Experimental results on various multiprocessor platforms and
task graphs demonstrate the efficacy of the proposed approach.
-
Integrated Scheduling and Synthesis of Control Applications on Distributed Embedded Systems
[p. 57]
-
S. Samii, P. Eles, Z. Peng and A. Cervin
Many embedded control systems comprise several control loops
that are closed over a network of computation nodes. In such
systems, complex timing behavior and communication lead to
delay and jitter, which both degrade the performance of each
control loop and must be considered during the controller synthesis.
Also, the control performance should be taken into
account during system scheduling. The contribution of this
paper is a control-scheduling co-design method that integrates
controller design with both static and priority-based scheduling
of the tasks and messages, and in which the overall control
performance is optimized.
-
Towards No-Cost Adaptive MPSoC Static Schedules through Exploitation of Logical-to-Physical
Core Mapping Latitude
[p. 63]
-
C. Yang and A. Orailoglu
The computing engines of many current applications
are powered by MPSoC platforms, which promise significant
speedup but induce increased reliability problems as a result
of ever growing integration density and chip size. While static
MPSoC execution schedules deliver predictable worst-case performance,
the absence of dynamic variability unfortunately constrains
their usefulness in such an unreliable execution environment.
Adaptive static schedules with predictable responses to runtime
resource variations have consequently been proposed, yet the
extra constraints imposed by adaptivity on task assignment have
resulted in schedule length increases. We propose to eradicate
the associated performance degradation of such techniques while
retaining all the concomitant benefits, by exploiting an inherent
degree of freedom in task assignment regarding the logical to
physical core mapping. The proposed technique relies on the
use of core reordering and rotation through utilizing a graph
representation model, which enables a direction translation of
inter-core communication paths into order requirements between
cores. The algorithmic implementation results confirm that the
proposed technique can drastically reduce the schedule length
overhead of both pre- and post- reconfiguration schedules.
-
Pipelined Data Parallel Task Mapping/Scheduling Technique for MPSoC
[p. 69]
-
H. Yang and S. Ha
In this paper, we propose a multi-task
mapping/scheduling technique for heterogeneous and scalable
MPSoC. To utilize the large number of cores embedded in
MPSoC, the proposed technique considers temporal and data
parallelisms as well as task parallelism. We define a multi-task
mapping/scheduling problem with all these parallelisms and
propose a QEA(quantum-inspired evolutionary algorithm)-based
heuristic. Compared with an ILP (Integer Linear Programming)
approach, experiments with real-life examples show the
feasibility and the efficiency of the proposed technique.
Moderators: B. Becker, Freiburg U, DE; M. Psarakis, Piraeus U, GR
-
Joint Logic Restructuring and Pin Reordering against NBTI-Induced Performance Degradation
[p. 75]
-
K.-C. Wu and D. Marculescu
Negative Bias Temperature Instability (NBTI), a PMOS aging
phenomenon causing significant loss on circuit performance and
lifetime, has become a critical challenge for temporal reliability
concerns in nanoscale designs. Aggressive technology scaling
trends, such as thinner gate oxide without proportional
downscaling of supply voltage, necessitate a design optimization
flow considering NBTI effects at the early stages. In this paper,
we present a novel framework using joint logic restructuring and
pin reordering to mitigate NBTI-induced performance
degradation. Based on detecting functional symmetries and
transistor stacking effects, the proposed methodology involves
only wire perturbation and introduces no gate area overhead at
all. Experimental results reveal that, by using this approach, on
average 56% of performance loss due to NBTI can be recovered.
Moreover, our methodology reduces the number of critical
transistors remaining under severe NBTI and thus, transistor
resizing can be applied to further mitigate NBTI effects with low
area overhead.
-
A Self-Adaptive System Architecture to Address Transistor Aging
[p. 81]
-
O. Khan and S. Kundu
As semiconductor manufacturing enters advanced
nanometer design paradigm, aging and device wear-out related
degradation is becoming a major concern. Negative Bias
Temperature Instability (NBTI) is one of the main sources of
device lifetime degradation. The severity of such degradation
depends on the operation history of a chip in the field, including
such characteristics as temperature and workloads. In this
paper, we propose a system level reliability management scheme
where a chip dynamically adjusts its own operating frequency
and supply voltage over time as the device ages. Major benefits
of the proposed approach are (i) increased performance due to
reduced frequency guard banding in the factory and (ii)
continuous field adjustments that take environmental operating
conditions such as actual room temperature and the power
supply tolerance into account. The greatest challenge in
implementing such a scheme is to perform calibration without a
tester. Much of this work is performed by a hypervisor like
software with very little hardware assistance. This keeps both
the hardware overhead and the system complexity low. This
paper describes the entire system architecture including
hardware and software components. Our simulation data
indicates that under aggressive wear-out conditions, scheduling
interval of days or weeks is sufficient to reconfigure and keep
the system operational, thus the run time overhead for such
adjustments is of no consequence at all.
-
Masking Timing Errors on Speed-Paths In Logic Circuits
[p. 87]
-
M.R. Choudhury and K. Mohanram
There is a growing concern about timing errors resulting from design
marginalities and the effects of circuit aging on speed-paths
in logic circuits. This paper presents a low overhead solution for
masking timing errors on speed-paths in logic circuits. Error masking
at the outputs of a logic circuit is achieved by synthesis of a non-intrusive
error-masking circuit that has at least 20% timing slack
over the original logic circuit. The error-masking circuit can also be
used to collect runtime information when the speed-paths are exercised
to (i) predict the onset of wearout and (ii) assist in in-system
silicon debug. Simulation results for several benchmark circuits
and modules from the OpenSPARC T1 processor are presented to
illustrate the effectiveness of the proposed solution. 100% masking
of timing errors on all speed-paths within 10% of the critical
path delay is achieved for all circuits with an average area (power)
overhead of 16% (18%).
Moderators: R. Dick, Northwestern U, US; R. Leupers, RWTH Aachen U, DE
-
WCRT Algebra and Interfaces for Esterel-Style Synchronous Processing
[p. 93]
-
M. Mendler, R. von Hanxleden and C. Traulsen
The synchronous model of computation together
with a suitable execution platform facilitates system-level timing
predictability. This paper introduces an algebraic framework for
precisely capturing worst case reaction time (WCRT) characteristics
for Esterel-style reactive processors with hardware-supported
multithreading. This framework provides a formal grounding
for the WCRT problem, and allows to improve upon earlier
heuristics by accurately and modularly characterizing timing
interfaces.
-
Reliable Mode Changes in Real-Time Systems with Fixed Priority or EDF Scheduling
[p. 99]
-
N. Stoimenov, S. Perathoner and L. Thiele
Many application domains require adaptive realtime
embedded systems that can change their functionality over
time. In such systems it is not only necessary to guarantee
timing constraints in every operating mode, but also during
the transition between different modes. Known approaches that
address the problem of timing analysis over mode changes are
restricted to fixed priority scheduling policies. In addition, most
of them are also limited to simple periodic event stream models
and therefore, they can not faithfully abstract the bursty timing
behavior which can be observed in embedded systems. In this
paper, we propose a new method for the design and analysis
of adaptive multi-mode systems that supports any event stream
model and can handle earliest deadline first (EDF) as well as fixed
priority (FP) scheduling of tasks. We embed the analysis method
into a well-established modular performance analysis framework
based on Real-Time Calculus and prove its applicability by
analyzing a case study.
-
Improved Worst-Case Response-Time Calculations by Upper-Bound Conditions
[p. 105]
-
V. Pollex, S. Kollman, K. Albers and F. Slomka
Fast real-time feasibility tests and analysis algorithms
are necessary for a high acceptance of the formal
techniques by industrial software engineers. This
paper presents a possibility to reduce the computation
time required to calculate the worst-case response time
of a task in a fixed-priority task set with jitter by
a considerable amount of time. The correctness of
the approach is proven analytically and experimental
comparisons with the currently fastest known tests
show the improvement of the new method.
-
A Generalized Scheduling Approach for Dynamic Dataflow Applications
[p. 111]
-
W. Plishker, N. Sane and S.S. Bhattacharyya
For a number of years, dataflow concepts have
provided designers of digital signal processing systems with
environments capable of expressing high-level software architectures
as well as low-level, performance-oriented kernels. But
analysis of system-level trade-offs has been inhibited by the
diversity of models and the dynamic nature of modern dataflow
applications. To facilitate design space exploration for software
implementations of heterogeneous dataflow applications, developers
need tools capable of deeply analyzing and optimizing
the application. To this end, we present a new scheduling
approach that leverages a recently proposed general model
of dynamic dataflow called core functional dataflow (CFDF).
CFDF supports high-level application descriptions with multiple
models of dataflow by structuring actors with sets of modes
that represent fixed behaviors. In this work we show that by
decomposing a dynamic dataflow graph as directed by its modes,
we can derive a set of static dataflow graphs that interact
dynamically. This enables designers to readily experiment with
existing dataflow model specific scheduling techniques to all or
some parts of the application while applying custom schedulers
to others. We demonstrate this generalized dataflow scheduling
method on dynamic mixed-model applications and show that
run-time and buffer sizes significantly improve compared to a
baseline dynamic dataflow scheduler and simulator.
Moderators: P. Pop, TU Denmark, DK; R. Woods, Queens U Belfast, IE
-
Optimizing Data Flow Graphs to Minimize Hardware Implementation
[p. 117]
-
D. Gomez-Prado, Q. Ren, M. Ciesielski, J. Guillot and E. Boutillon
This paper describes an efficient graph-based
method to optimize data-flow expressions for best hardware
implementation. The method is based on factorization, common
subexpression elimination (CSE) and decomposition of algebraic
expressions performed on a canonical representation, Taylor Expansion
Diagram. The method is generic, applicable to arbitrary
algebraic expressions and does not require specific knowledge of
the application domain. Experimental results show that the DFGs
generated from such optimized expressions are better suited for
high level synthesis, and the final, scheduled implementations
are characterized, on average, by 15.5% lower latency and
7.6% better area than those obtained using traditional CSE and
algebraic decomposition.
-
Multi-Clock SOC Design Using Protocol Conversion
[p. 123]
-
R. Sinha, P.S. Roop, Z. Salcic and S. Basu
The automated design of SoCs from pre-selected IPs that may require different clocks
is challenging because of the following issues. Firstly, protocol mismatches between
IPs need to be resolved automatically before IPs are integrated. Secondly, the
presence of multiple clocks makes the protocol conversion even more difficult.
Thirdly, it is desirable that the resulting integration is correct-by-construction,
i.e., the resulting SoC satisfies given system-level specifications. All of these
issues have been studied extensively, although not in a unifying manner. In this paper
we propose a framework based on protocol conversion that addresses all these issues.
We have extensively studied many SoC design problems and show that the proposed
methodology is capable of handling them better than other known approaches. A
significant contribution of the proposed approach is that it nicely generalizes many
existing techniques for formal SoC design and integrates them into a single approach.
-
A Formal Approach to Design Space Exploration of Protocol Converters
[p. 129]
-
K. Avnit and A. Sowmya
In the field of chip design, hardware module reuse is a
standard solution to the increasing complexity of chip architecture
and the pressures to reduce time to market. In the
absence of a single module interface standard, integration
of pre-designed modules often requires the use of protocol
converters. For an arbitrary pair of incompatible protocols
it is likely that there exist more than one possible converter.
However, existing approaches to automatic synthesis of protocol
converters either produce a single suggested converter
or provide a general nondeterministic solution, out of which
a designer is required to extract a deterministic converter.
In this work we present a novel approach for design space
exploration of FSM based protocol converters. We present
algorithms for extraction of minimal converters for a given
pair of incompatible protocols. We demonstrate the process
through a simple example, and report on results of experiments
with converters for commercial protocols AMBA
ASB, APB and the Open Core Protocol (OCP). The experiments
show a reduction in the number of states in the converter
of as much as 62% (with an average reduction of
42%) and a reduction in the number of transitions of as
much as 85% (with an average reduction of 61%), demonstrating
the benefits of design space exploration.
-
Model-Based Synthesis and Optimization of Static Multi-Rate Image Processing Algorithms
[p. 135]
-
J. Keinert, H. Dutta, F. Hannig, C. Haubelt and J. Teich
High computational effort in modern image processing
applications like medical imaging or high-resolution video
processing often demands for massively parallel special purpose
architectures in form of FPGAs or ASICs. However, their efficient
implementation is still a challenge, as the design complexity
causes exploding development times and costs. This paper
presents a new design flow which permits to specify, analyze, and
synthesize complex image processing algorithms. A novel buffer
requirement analysis allows exploiting possible tradeoffs between
required communication memory and computational logic for
multi-rate applications. The derived schedule and buffer results
are taken into account for resource optimized synthesis of the
required hardware accelerators. Application to a multi-resolution
filter shows that buffer analysis is possible in less than one
second and that scheduling alternatives influence the required
communication memory by up to 24% and the computational
resources by up to 16%.
Organizer: M. Casale-Rossi, Synopsys, IT
Moderator: G. De Micheli, EPFL, CH
Panelists: A. Domic, M. Montalti, M. Muller, J. Sawicki
-
-
"Othello, The Moor of Venice" is undoubtedly one of the most famous plays by
William Shakespeare, and because of its themes - love, jealousy and betrayal -
it remains relevant to the present day... and to the electronic industry!
Like Othello, EDA vendors confront their Desdemona IC vendor customers, accusing
them of adultery whenever they seem to look for somebody else's - Cassio - EDA
technology, pretending they consolidate on a single-vendor EDA solution.
According to these high-tech Othellos, a cooperative approach and a faithful
partner are indeed essential ingredients towards the allocation of the huge
financial and R&D resources that are required to improve existing EDA
technology and develop the new EDA technology that is badly needed as the
challenges of nanometer IC designs become more complex. Adultery, besides being a
serious relationship issue, harms the EDA innovations' stream, which is so
necessary to fuel the electronic industry's growth.
Are these modern Othellos visionary heroes, trying to spare their Desdemona the
perils of slipshod solutions, which may harm silicon success at 45 nanometers and
thereafter, or are they egoistic fools, only interested in making as much revenue
as possible by enforcing groundless franchises? Is Desdemona sincerely interested
in advancing design technology, or just playing the one vendor against the other
game to save extra money? And finally, who's playing the role of Iago - the
sinister villain who sets out to foment disunity among the other players? In this
panel, Desdemona will explain the reasons why she's given Cassio her handkerchief,
while Othello will try to convince Desdemona that she should be loyal and
faithful to him, truly partnering to fight the challenges of nanometer IC designs.
Is there a happy end for Othello and Desdemona?
Moderators: M. Miranda, IMEC, BE; W. Dehaene, KU Leuven, BE
-
Variation Resilient Adaptive Controller for Subthreshold Circuits
[p. 142]
-
B. Mishra, B.M. Al-Hashimi and M. Zwolinski
Subthreshold logic is showing good promise as
a viable ultra-low-power circuit design technique for power-limited
applications. For this design technique to gain widespread
adoption, one of the most pressing concerns is how to improve
the robustness of subthreshold logic to process and temperature
variations. We propose a variation resilient adaptive controller
for subthreshold circuits with the following novel features: new
sensor based on time-to-digital converter for capturing the
variations accurately as digital signatures, and an all-digital DCDC
converter incorporating the sensor capable of generating an
operating operating Vdd from 0V to 1.2V with a resolution of
18.75mV, suitable for subthreshold circuit operation. The benefits
of the proposed controller is reflected with energy improvement
of upto 55% compared to when no controller is employed. The
detailed implementation and validation of the proposed controller
is discussed.
-
Minimization of NBTI Performance Degradation Using Internal Node Control
[p. 148]
-
D.R Bild, G.E. Bok and R.P. Dick
Negative Bias Temperature Instability (NBTI) is
a significant reliability concern for nanoscale CMOS circuits.
Its effects on circuit timing can be especially pronounced for
circuits with standby-mode equipped functional units because
these units can be subjected to static NBTI stress for extended
periods of time. This paper proposes internal node control, in
which the inputs to individual gates are directly manipulated to
prevent this static NBTI fatigue. We give a mixed integer linear
program formulation for an optimal solution to this problem.
The optimal placement of internal node control yields an average
26.7% reduction in NBTI-induced delay over a ten year period
for the ISCAS85 benchmarks. We find that the problem is
NP-complete and present a linear-time heuristic that can be used
to quickly find near-optimal solutions. The heuristic solutions are,
on average, within 0.17% of optimal and all were within 0.60%
of optimal.
-
Physically Clustered Forward Body Biasing for Variability Compensation in Nano-Meter CMOS Design
[p. 154]
-
A. Sathanur, A. Pullini, G. De Micheli, L. Benini and E. Macii
Nanometer CMOS scaling has resulted in greatly increased circuit
variability, with extremely adverse consequences on design predictability
and yield. A number of recent works have focused on
adaptive post-fabrication tuning approaches to mitigate this problem.
Adaptive Body Bias (ABB) is one of the most successful
tuning "knobs" in use today in high-performance custom design.
Through forward body bias (FBB), the threshold voltage of the
CMOS devices can be reduced after fabrication to bring the slow
dies back to within the range of acceptable specs. FBB is usually
applied with a very coarse core-level granularity at the price of a
significantly increased leakage power. In this paper, we propose
a novel, physically clustered FBB scheme on row-based standardcell
layout style that enables selective forward body biasing of only
of the rows that contain most timing critical gates, thereby reducing
leakage power overhead. We propose exact and heuristic algorithms
to partition the design and allocate optimal body bias voltages
to achieve minimum leakage power overhead. This style is
fully compatible with state-of-the-art commercial physical design
flows and imposes minimal area blowup. Benchmark results show
large leakage power savings with a maximum savings of 30% in
case of 5% compensation and 47.6% in case of 10% compensation
with respect to block-level FBB and minimal implementation area
overhead.
-
An Event-Guided Approach to Reducing Voltage Noise in Processors
[p. 160]
-
M.S Gupta, V.J Reddi, G. Holloway, G.-Y. Wei and D. Brooks
Supply voltage fluctuations that result from inductive
noise are increasingly troublesome in modern microprocessors.
A voltage "emergency", i.e., a swing beyond tolerable
operating margins, jeopardizes the safe and correct operation
of the processor. Techniques aimed at reducing power consumption,
e.g., by clock gating or by reducing nominal supply
voltage, exacerbate this noise problem, requiring ever-wider
operating margins. We propose an event-guided, adaptive method
for avoiding voltage emergencies, which exploits the fact that
most emergencies are correlated with unique microarchitectural
events, such as cache misses or the pipeline flushes that follow
branch mispredictions. Using checkpoint and rollback to handle
unavoidable emergencies, our method adapts dynamically by
learning to trigger avoidance mechanisms when emergency-prone
events recur. After tightening supply voltage margins to increase
clock frequency and accounting for all costs, the net result is a
performance improvement of 8% across a suite of fifteen SPEC
CPU2000 benchmarks.
Moderators: R. Cottrell, Altera European Technology Centre; C. Heer, Infineon Technologies, DE
-
Design and Implementation of a Database Filter for BLAST Acceleration
[p. 166]
-
P. Afratis, C. Galanakis, E. Sotiriades, G.-G. Mplemenos, G. Chrysos, I. Papaefstathiou and D. Pnevmatikatos
BLAST is a very popular Computational
Biology algorithm. Since it is computationally expensive it
is a natural target for acceleration research, and many
reconfigurable architectures have been proposed offering
significant improvements.
In this paper we approach the same problem with a
different approach: we propose a BLAST algorithm
preprocessor that efficiently identifies the portions of the
database that must be processed by the full algorithm in
order to find the complete set of desired results. We show
that this preprocessing is feasible and quick, and requires
minimal FPGA resources, while achieving a significant
reduction in the size of the database that needs to be
processed by BLAST. We also determine the parameters
under which prefiltering is guaranteed to identify the same
set of solutions as the original NCBI software.
We model our preprocessor in VHDL and implement it
in reconfigurable architecture. To evaluate the
performance, we use a large set of datasets and compare
against the original (NCBI) software. Prefiltering is able to
determine that between 80 and 99.9% of the database will
not produce matches and can be safely ignored. Processing
only the remaining portions using software such as NCBI-BLAST
improves the system performance (reduces
execution time) by 3 to 15 times. Since our prefiltering
technique is generic, it can be combined with any other
software or reconfigurable acceleration technique.
-
A Software-Supported Methodology for Exploring Interconnection Architectures Targeting
3-D FPGAs
[p. 172]
-
K. Siozios, V.F. Pavlidis and D. Soudris
Interconnect structures significantly contribute to the
delay, power consumption, and silicon area of modern
reconfigurable architectures. The demand for higher clock
frequencies and logic densities is also important for the Field-Programmable
Gate Array (FPGA) paradigm. Threedimensional
(3-D) integration can alleviate such performance
limitations by accommodating a number of additional silicon
layers. However, the benefits of 3-D integration have yet to be
sufficiently investigated. In this paper, we propose a software-supported
methodology to explore and evaluate 3-D FPGAs
fabricated with alternative technologies. Based on the evaluation
results, the proposed FPGA device improves speed and energy
dissipation by approximately 38% and 26%, respectively, as
compared to 2-D FPGAs. Furthermore, these gains are achieved
in addition to reducing the interlayer connections, as compared
to existing design approaches, leading to cheaper and more
reliable architectures.
Keywords-FPGA; 3-D integration; interconnection architectures;
CAD tools
-
Priority-Based Packet Communication on a Bus-Shaped Structure for FPGA-Systems
[p. 178]
-
O. Sander, B. Glas, C. Roth, J. Becker and K.D. Mueller-Glaser
We present an application tailored packed-based
SoC communication system with one-hop communication between
all entities, priority-based arbitration, broadcast and
multicast support on a bus-shaped basis. It is located as a
hybrid between NoC and bus approaches, closing the gap for
mostly streaming-based systems with the need for highly flexible
communication patterns and multicast messages that are below
a certain size. The system is implemented and evaluated on a
FPGA within a car-to-car communication gateway application.
-
Exploration of Power Reduction and Performance Enhancement in LEON3 Processor with ESL
Reprogrammable eFPGA in Processor Pipeline and as a Co-Processor
[p. 184]
-
S.Z. Ahmed, J. Eydoux, L. Rouge, J.-P. Cuelle, G. Sassatelli and L. Torres
We will explore how processing power of LEON3 processor
can be enhanced by connecting small commercially available
embedded FPGA (eFPGA) IP with the processor. We will analyze
integration of eFPGA with LEON3 in two ways, inside the
processor pipeline and as a co-processor. The enhanced
processing power helps to reduce dynamic power consumption by
Dynamic Frequency Scaling. More computational power at lower
frequency helps fabrication of chip in LP (Low Power) process
compared to GP (General Purpose) which helps to significantly
reduce Static Power which has become a very crucial issue at and
beyond 90nm technologies.
Use of reconfigurable accelerator raises the question of its
programming complexity, HW/SW partitioning and silicon
overhead. We will present that silicon overhead of eFPGA is small
compared to the benefits which can be obtained with it. We will
present a profiling tool which we created for our experiments. To
analyze the issue of programming complexity we have explored
state of the art CatapultTM ESL tool of Mentor Graphics®.
Organiser/Moderator: W. Mueller, Paderborn U, DE
-
Functional Qualification of TLM Verification
[p. 190]
-
N. Bombieri, F. Fummi, G. Pravadelli, M. Hampton and F. Letombe
The topic will cover the use of functional qualification
for measuring the quality of functional verification of
TLM models. Functional qualification is based on the theory of
mutation analysis but considers a mutation to have been killed
only if a testcase fails. A mutation model of TLM behaviors is
proposed to qualify a verification environment based on both
testcases and assertions. The presentation describes at first the
theoretic aspects of this topic and then it focuses on its application
to real cases by using actual EDA tools, thus showing advantages
and limitations of the application of mutation analysis to TLM.
-
Solver Technology for System-level to RTL Equivalence Checking
[p. 196]
-
A. Koelbl, R. Jacoby, H. Jain and C. Pixley
Checking the equivalence of a system-level model
against an RTL design is a major challenge. The reason is
that usually the system-level model is written by a system
architect, whereas the RTL implementation is created by a
hardware designer. This approach leads to two models that
are significantly different. Checking the equivalence of real-life
designs requires strong solver technology. The challenges can
only be overcome with a combination of bit-level and word-level
reasoning techniques, combined with the right orchestration. In
this paper, we discuss solver technology that has shown to be
effective on many real-life equivalence checking problems.
Moderators: F. Novak, Josef Stefan Institute, SI; V. Singh, Indian Institute of Science, IN
-
A High-Level Debug Environment for Communication-Centric Debug
[p. 202]
-
K. Goossens, B. Vermeulen and A.B. Nejad
A large part of a modern SOC's debug complexity
resides in the interaction between the main system components.
Transaction-level debug moves the abstraction level of the debug
process up from the bit and cycle level to the transactions
between IP blocks. In this paper we raise the debug abstraction
level further, by utilising structural and temporal abstraction
techniques, combined with debug data interpretation and logical
communication views. The combination of these techniques and
views allow us, among others, to single-step and observe the
operation of the network on a per-connection basis. As an
example, we show how these higher-level abstractions have been
implemented in the debug environment for the Æthereal NOC
architecture and present a generic debug API, which can be used
to visualise an SOC's state at the logical communication level.
-
Cache Aware Compression for Processor Debug Support
[p. 208]
-
A. Vishnoi, P.R. Panda and M. Balakrishnan
During post-silicon processor debugging, we need
to frequently capture and dump out the internal state of the
processor. Since internal state constitutes all memory elements,
the bulk of which is composed of cache, the problem is essentially
that of transferring cache contents off-chip, to a logic analyzer.
In order to reduce the transfer time and save expensive logic
analyzer memory, we propose to compress the cache contents
on their way out. We present a hardware compression engine
for cache data using a Cache Aware Compression strategy that
exploits knowledge of the cache fields and their behavior to
achieve an effective compression. Experimental results indicate
that the technique results in 7-31% better compression than
one that treats the data as just one long bit stream. We also
describe and evaluate a parallel compression architecture that
uses multiple compression engines, resulting in a 54% reduction
in transfer time.
-
Fault Insertion Testing of a Novel CPLD-Based Fail-Safe System
[p. 214]
-
G. Griessnig, R. Mader, C. Steger and R. Weiss
According to the standard IEC 61508 fault insertion
testing is required for the verification of fail-safe systems. Usually
these systems are realized with microcontrollers. Fail-safe
systems based on a novel CPLD-based architecture require a
different method to perform fault insertion testing than
microcontroller-based systems. This paper describes a method to
accomplish fault insertion testing of a system based on the novel
CPLD-based architecture using the original system hardware.
The goal is to verify the realized safety integrity measures of the
system by inserting faults and observing the behavior of the
system. The described method exploits the fact, that the system
contains two channels, where both channels contain a CPLD.
During a test one CPLD is configured using a modified
programming file. This file is available after the compilation of a
VHDL-description, which was modified using saboteurs or
mutants. This allows injecting a fault into this CPLD. The other
CPLD is configured as fault-free device. The entire system has to
detect the injected fault using its safety integrity measures.
Consequently it has to enter and/or maintain a safe state.
Keywords-IEC 61508; fail-safe system; safety integrity; fault
insertion testing; fault injection; CPLD; VHDL
-
Test Architecture Design and Optimization for Three-Dimensional SoCs
[p. 220]
-
L. Jiang, L. Huang and Q. Xu
Core-based system-on-chips (SoCs) fabricated on three-dimensional
(3D) technology are emerging for better integration
capabilities. Effective test architecture design and
optimization techniques are essential to minimize the manufacturing
cost for such giga-scale integrated circuits. In this
paper, we propose novel test solutions for 3D SoCs manufactured
with die-to-wafer and die-to-die bonding techniques.
Both testing time and routing cost associated with
the test access mechanisms in 3D SoCs are considered in
our simulated annealing-based technique. Experimental results
on ITC'02 SoC benchmark circuits are compared to
those obtained with two baseline solutions, which show the
effectiveness of the proposed technique.
Moderators: P. Mosterman, The MathWorks, US; E. Villar, Cantabria U, ES
-
A Co-Design Approach for Embedded System Modeling and Code Generation with UML and MARTE
[p. 226]
-
J. Vidal, F. de Lamotte, G. Gogniat, P. Soulard and J.-P. Diguet
In this paper we propose a UML/MDA approach,
called MoPCoM methodology, to design high quality real-time
embedded systems. We have defined a set of rules to build
UML models for embedded systems, from which VHDL code
is automatically generated by means of MDA techniques. We use
the MARTE profile as an UML extension to describe real-time
properties and perform platform modeling.
The MoPCoM methodology defines three abstraction levels:
abstract, execution and detailed modeling levels (AML, EML
and DML, respectively). We detail the lowest MoPCoM level,
DML, design rules in order to perform automatically VHDL
code generation. A viterbi coder has been used as a first case
study.
-
Componentizing Hardware/Software Interface Design
[p. 232]
-
K. Hao and F. Xie
Building highly optimized embedded systems demands
hardware/software (HW/SW) co-design. A key challenge
in co-design is the design of HW/SW interfaces, which is often
a design bottleneck. We propose a novel approach to HW/SW
interface design based on the concept of bridge component.
Bridge components fill the HW/SW semantic gap by propagating
events across the HW/SW boundary and raise the abstraction
level for designing HW/SW interfaces by abstracting processors,
buses, embedded OS, etc. of embedded system platforms. Bridge
components are specified in platform-specific Bridge Specification
Languages (BSLs) and compiled by the BSL compilers for
simulation and deployment.We have applied our approach to two
different embedded system platforms. Case studies have shown
that bridge components greatly simplify component-based codesign
of embedded systems and system simulation speed can
be improved three orders of magnitude by simulating bridge
components on the transaction level.
-
A UML Frontend for IP-XACT-Based IP Management
[p. 238]
-
T. Schattkowsky, T. Xie and W. Mueller
IP-XACT is a well accepted standard for the exchange of
IP components at Electronic System and Register Transfer Level.
Still, the creation and manipulation of these descriptions at the XML
level can be time-consuming and error-prone. In this paper, we show
that the UML can be consistently applied as an efficient and
comprehensible frontend for IP-XACT-based IP description and
integration. For this, we present an IP-XACT UML profile that
enables UML-based descriptions covering the same information as a
corresponding IP-XACT description. This enables the automated
generation of IP-XACT component and design descriptions from
respective UML models. In particular, it also allows the integration of
existing IPs with UML. To illustrate our approach, we present an
application example based on the IBM PowerPC Evaluation Kit.
Keywords-ESL design, RTL design, IP-XACT, IP Management,
UML Profile
-
Evaluating UML2 Modeling of IP-XACT Objects for Automatic MP-SoC Integration onto FPGA
[p. 244]
-
T. Arpinen, T. Koskinen, E. Salminen, T.D. Hamalainen and M. Hannikainen
IP-XACT is a standard for describing intellectual
property metadata for System-on-Chip (SoC) integration. Recently
researchers have proposed visualizing and abstracting
IP-XACT objects using structural UML2 model elements and
diagrams. Despite the number of proposals at conceptual level,
experiences on utilizing this representation in practical SoC
development environments are very limited. This paper presents
how UML2 models of IP-XACT features can be utilized to
efficiently design and implement a multiprocessor SoC prototype
on FPGA. The main contribution of this paper is the experimental
development of a multiprocessor platform on FPGA using UML2
design capture, IP-XACT compatible components, and design
automation tools. In addition, modeling concepts are improved
from earlier work for the utilized integration methodology.
Moderators: T. Basten, Twente U, NL; S. Yoo, POSTECH (Pohang U of Science and Technology), KR
-
aelite: A Flit-Synchronous Network on Chip with Composable and Predictable Services
[p. 250]
-
A. Hansson, M. Subburaman and K. Goossens
To accommodate the growing number of applications
integrated on a single chip, Networks on Chip (NoC) must
offer scalability not only on the architectural, but also on the
physical and functional level. In addition, real-time applications
require Guaranteed Services (GS), with latency and throughput
bounds. Traditionally, NoC architectures only deliver scalability
on two of the aforementioned three levels, or do not offer GS.
In this paper we present the composable and predictable aelite
NoC architecture, that offers only GS, based on flit-synchronous
Time Division Multiplexing (TDM). In contrast to other TDM-based
NoCs, scalability on the physical level is achieved by using
mesochronous or asynchronous links. Functional scalability is
accomplished by completely isolating applications, and by having
a router architecture that does not limit the number of service
levels or connections. We demonstrate how aelite delivers the
requested service to hundreds of simultaneous connections, and
does so with 5 times less area compared to a state-of-the-art NoC.
-
Configurable Links for Runtime Adaptive On-Chip Communication
[p. 256]
-
M.A. Al Faruque, T. Ebi and J. Henkel
Reliability concerns associated with upcoming technology
nodes coupled with unpredictable system scenarios resulting
from increasingly complex systems require considering
runtime adaptivity in all possible parts of future on-chip systems.
We are presenting a novel configurable link which can change its
supported bandwidth on-demand at runtime (2X-Links) for an
adaptive on-chip communication architecture. We have evaluated
our results using real-time multi-media and the E3S application
benchmark suits. Our 2X-Links provide a higher throughput
of up to 36%, with an average throughput increase of 21.3%,
compared to the Normal-Full-Duplex-Links [12], [14], [17], [20]
and keep performance-related guarantees with as low as 50% of
the Normal-Full-Duplex-Links capacity. Our simulation shows
when some links fail, the NoC with 2X-Links can recover from
these faults with an average probability of 82.2% whereas these
faults would be fatal for the Normal-Full-Duplex-Links.
-
Synthesis of Low-Overhead Configurable Source Routing Tables for Network Interfaces
[p. 262]
-
I. Loi, F. Angiolini and L. Benini
In on-chip multiprocessor communication, link failures
and dynamically changing application scenarios represent
demanding constraints for the provision of suitable Quality of
Service. Networks-on-Chip (NoCs) featuring dynamic routing
are a known way to tackle these issues, but deadlock freedom
and message ordering concerns arise. NoCs with configurable
routing, whereby the communication routes are explicitly chosen
at runtime out of a set of statically predefined alternatives,
provide intelligent adaptation without impacting the consistency
of traffic flows.
However, configurable source routing on a NoC platform
requires a design that provides fast path lookup coupled with low
area and power consumption. This paper presents an exploration
and synthesis approach that, depending on the required amount
of routing flexibility, can for example reduce by 3 to 15 times
the area cost of the NoC routing tables by adopting partially
reprogrammable routing logic instead of fully reprogrammable
tables. Further optimizations based on path redundancy allow to
reduce up to 17 times the silicon cost.
-
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular
Reconfigurable Systems
[p. 268]
-
A. Jara-Berrocal and A. Gordon-Ross
Parallel architectures have become an increasingly
popular method in which to achieve high performance with low
power consumption. In order to leverage these benefits,
applications are decomposed into multiple computational
modules (tasks) that collectively operate and communicate in
parallel. In this paper, we present a scalable and highly
parametric streams-based communication architecture for
inter-module communication for FPGA-based systems.
SCORES. This communication architecture improves on
previous methods by providing increased application
specialization and heterogeneous module clock frequencies, as
well as providing a means for low latency communication and
data throughput guarantees.
Organizer/Moderator: H. Graeb, TU Munich, DE
Panelists: J. Cessna, G. Goelz, V. Meyer zu Bexten and E. Petrus
-
Analog Layout Synthesis - Recent Advances in Topological Approaches
[p. 274]
-
H. Graeb, F. Balasa, R. Castro-Lopez, Y.-W. Chang, F.V. Fernandez, P.-H. Lin and M. Strasser
This paper gives an overview of some recent advances in
topological approaches to analog layout synthesis and in layout-aware
analog sizing. The core issue in these approaches is the modeling of
layout constraints for an efficient exploration process. This includes fast
checking of constraint compliance, reducing the search space, and quickly
relating topological encodings to placements. Sequence-pairs, B*-trees,
circuit hierarchy and layout templates are described as advantageous
means to tackle these tasks.
-
An Accurate Interconnect Thermal Model Using Equivalent Transmission Line Circuit
[p. 280]
-
B. Wang and P. Mazumder
This paper presents an accurate interconnect thermal
model for analyzing the temperature distribution of an
on-chip interconnect wire. The model addresses the ambient
temperatures and the heat transfer rates of the packaging
materials. Particularly, the model considers the effect of the interconnect
temperature gradients. The paper employs an equivalent
transmission line circuit to obtain the temperature distribution
solution from the model. Then an O(n) algorithm is introduced
to compute the interconnect temperatures. Experimental results
demonstrate the accuracy of the thermal model, by comparisons
with the computational fluid dynamics tool FLUENT.
-
Analogue Mixed Signal Simulation Using Spice and SystemC
[p. 284]
-
T. Kirchner, N. Bannow and C. Grimm
SystemC is a discrete event simulator that enables
the programmer to model complex designs with varying levels
of abstraction. In order to improve precision, it can be coupled
to more specialized simulators.
This article introduces the concept of loose simulator coupling
between an analogue simulator and SystemC.
It explains the properties and advantages which include a
higher simulation performance as well as a higher degree of
flexibility.
A design example in which SystemC will be connected to
SwitcherCad will demonstrate the benefits of loose coupling.
-
Reliability Aware through Silicon Via Planning for 3D Stacked ICs
[p. 288]
-
A. Shayan, X. Hu, H. Peng, C.-K. Cheng, W. Yu, M. Popovich, T. Toms and X. Chen
This work proposes reliability aware through silicon
via (TSV) planning for the 3D stacked silicon integrated
circuits (ICs). The 3D power distribution network is modeled
and extracted in frequency domain which includes the impact
of skin effect. The worst case power noise of the 3D power
delivery networks (PDN) with local TSV failures resulting from
fabrication process or circuit operation is identified in both
frequency and time domain. From the experimental results, it is
observed that a single TSV failure could increase the maximum
voltage variation up to 70% which should be considered in
nanoscale ICs. The parameters of the 3D PDN are designed such
that the power distribution is reliable under local TSV failures.
The spatial distribution of the power noise, reliability and block
out area is analyzed to enhance the reliability of the 3D PDN
under local TSV failure1.
-
A Study on Placement of Post Silicon Clock Tuning Buffers for Mitigating Impact of Process Variation
[p. 292]
-
K. Nagaraj and S. Kundu
Optical shrink for process migration, manufacturing
process variation, temperature and voltage changes lead
to clock skew as well as path delay variations in a
manufactured chip. Such variations end up degrading the
performance of manufactured chips. Since, such
variations are hard to predict in pre-silicon phase,
tunable clock buffers have been used in several designs.
These buffers are tuned to improve maximum operating
clock frequency of a design. Previously, we have
presented an algorithmic approach that uses delay
measurements on a few selected patterns to determine
which buffers should be targeted for tuning. In this paper,
a study on impact of tunable buffer placement on
performance is reported. Greatest benefit from tunable
buffer placement is observed, when the clock tree is
designed by the proposed tuning system assuming random
delay perturbations during design. Accordingly, we
present a clock tree synthesis procedure which offer very
good protection against process variation as borne out by
the results.
-
Analysis and Optimization of NBTI Induced Clock Skew in Gated Clock Trees
[p. 296]
-
A. Chakraborty, G. Ganesan, A. Rajaram and D.Z. Pan
NBTI (Negative Bias Temperature Instability) has emerged
as the dominant PMOS device failure mechanism for sub-100nm
VLSI designs. There is little research to quantify its
impact on skew of clock trees. This paper demonstrates a
mathematical framework to compute the impact of NBTI on
gating-enabled clock tree considering their workload dependent
temperature variation. Circuit design techniques are
proposed to deal with NBTI induced clock skew by achieving
balance in NBTI degradation of clock devices. Our technique
achieves up-to 70% reduction in clock skew degradation with
miniscule (<0.1%) power and area penalty.
-
Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs
[p. 300]
-
A. Flynn, A. Gordon-Ross and A.D. George
Partial Reconfiguration (PR) of FPGAs presents many
opportunities for application design flexibility, enabling tasks to
dynamically swap in and out of the FPGA without entire system
interruption. However, mapping a task to any available PR
region (PRR) requires a unique partial bitstream for each PRR.
This replication can introduce significant overheads in terms of
bitstream storage and communication requirements. Previous
research in partial bitstream relocation can alleviate these
overheads by transforming a single partial bitstream to map to
any available PRR. However, careful steps are necessary to
ensure proper functionality of relocated partial bitstreams and
may result in clock routing inefficiencies. These routing
inefficiencies can be alleviated by using regional clock resources
introduced in the Virtex-4 FPGAs to implement local clock
domains. PRRs can internally drive local clock domains, enabling
each PRR to vary its clock frequency with respect to a single
global clock signal, as opposed to sending multiple global clock
signals (one for each desired clock frequency) to each PRR. We
introduce this novel local clock domain (LCD) concept, which
provides enhanced PR design flexibility. However, integration of
LCDs and partial bitstream relocation introduces new challenges.
In this paper, we identify motivating application domains for this
integration, analyze integration benefits, and provide a detailed
integration methodology.
Keywords-partial reconfiguration; relocatable; local clock
-
Parallel Transistor Level Full-Chip Circuit Simulation
[p. 304]
-
H. Peng and C.-K. Cheng
In this paper, we present a fully parallel transistor
level full-chip circuit simulation tool with SPICE-accuracy
for general circuit designs. The proposed overlapping domain
decomposition approach partitions the circuit into a linear subdomain
and multiple non-linear subdomains based on circuit
non-linearity and connectivity. Parallel iterative matrix solver
is used to solve the linear domain while non-linear subdomains
are parallelly distributed into different processors topologically
and solved by direct solver. To achieve maximum parallelism,
device model evaluation is done parallelly. Parallel domain
decomposition technique is used to iteratively solve the different
partitions of the circuit and ensure convergence. Orders of
magnitude speedup over SPICE is observed for sets of large-scale
circuit designs on up to 64 processors.
-
Performance-Driven Dual-Rail Insertion for Chip-Level Pre-Fabricated Design
[p. 308]
-
F.-W. Chen and Y.-Y. Liu
In recent years, pre-fabricated design styles grow up rapidly
to amortize the mask cost. However, the interconnection delay
of the pre-fabricated design styles slows down the circuit
performance due to the high capacitive load. In this paper,
we propose a technique to insert dual-rail wires for pre-fabricated
design styles. Furthermore, we propose an effective
dual-rail insertion algorithm to reduce the routing area overheads
caused by the inserted dual-rail wires. Taking the wire
criticality, the delay significance, and the wire congestion into
consideration, our proposed algorithm is capable of trading
additional routing area overheads for the interconnection performance
improvement. The experimental results demonstrate
that our proposed algorithm reduces the interconnection delay
by 11.4% with 5.8% routing area overheads.
-
Simulation Framework for Early Phase Exploration of SDR Platforms: A Case Study of Platform
Dimensioning
[p. 312]
-
M. Trautmann, S. Mamagkakis, B. Bougard, J. Declerck, E. Umans, A. Dejonghe, L. Van der Perre
and F. Catthoor
Software Defined Radio (SDR) terminals are crucial
to enable seamless and transparent inter-working between fourth
generation wireless access systems or communication modes. On
the longer term, SDRs will be extended to become Cognitive
Radios enabling efficient spectrum usage. Future communication
modes will have heavy hardware resource requirements and
switching between them will introduce dynamism in respect
with timing and size of resource requests. In this paper, we
propose a modeling framework that enables the simulation of
such complex, dynamic hardware/software SDR designs. Thus,
we can do an exploration, which can pinpoint the coarse grain
platform component requirements for future SDR applications
in a very early design phase. Our solution differs from existing
ones by combining multiple simulation granularities in a way that
is specialized for SDR simulation. Finally, we demonstrate the
effectiveness of our approach with a case study for dimensioning
the on-chip interconnect of a prospective SDR platform.
-
Fast and Accurate Protocol Specific Bus Modeling Using TLM 2.0
[p. 316]
-
B. van Moll, H. Corporaal, V. Reyes and M. Boonen
The need to have Transaction Level models early in
the design cycle is becoming more and more important to shorten
the development times of complex Systems-on-Chip (SoC). These
models need to be functional and timing accurate in order to
address different design use-cases during the SoC development.
However the typical issue with Transaction Level Modeling
(TLM) techniques is the accuracy vs. simulation speed trade-off.
Models that can run at high simulation speeds are often modeled
at abstraction levels that make them unsuitable for use-cases
where timing accuracy is required. Similarly, most models that
are cycle accurate are inherently too slow (due to clock sensitive
processes) to be used in use-cases where high simulation speed is
key. This paper introduces a new methodology that enables the
creation of fast and cycle accurate protocol specific bus-based
communication models, based on the new TLM 2.0 standard
from the Open SystemC Initiative (OSCI).
-
Incorporating Graceful Degradation into Embedded System Design
[p. 320]
-
M. Glass, M. Lukasiewycz, C. Haubelt and J. Teich
In this work, the focus is put on the behavior of a system
in case a fault occurs that disables the system from executing
its applications. Instead of executing a random subset of the
applications depending on the fault, an approach is presented
that optimizes the systems structure and behavior with respect
to a possible graceful degradation. It includes a degradation-aware
reliability analysis that guides the optimization of the resource
allocation and function distribution, and provides data-structures
for an efficient online degradation algorithm. Thus,
the proposed methodology covers both, the design phase with a
structural optimization and the online phase with a behavioral
optimization of the system. A case study shows the effectiveness
of the proposed approach.
-
Rewiring Using IRredundancy Removal and Addition
[p. 324]
-
C.-C. Lin and C.-Y. Wang
Redundancy Addition and Removal (RAR) is a
restructuring technique used in the synthesis and optimization of
logic designs. It can remove an existing target wire and add an
alternative wire in the circuit such that the functionality of the
circuit is intact. However, not every irredundant target wire can
be successfully removed due to some limitations. Thus, this paper
proposes a new restructuring technique, IRredundancy Removal
and Addition (IRRA), which successfully removes any desired
target wire by constructing a rectification network which exactly
corrects the error caused by removing the target wire.
Moderators: V Mooney III, Georgia Institute of Technology, US; J. Henkel, Karlsruhe U, DE
-
Gate Replacement Techniques for Simultaneous Leakage and Aging Optimization
[p. 328]
-
Y. Wang, X. Chen, W. Wang, Y. Cao, Y. Xie and H. Yang
As technology scales, the aging effect caused by Negative
Bias Temperature Instability (NBTI) has become a major reliability
concern for circuit designers. On the other hand, reducing leakage
power remains to be one of the design goals. Because both NBTI-induced
circuit degradation and standby leakage power have a strong
dependency on the input vectors, Input Vector Control (IVC) technique
may be adopted to mitigate leakage and NBTI. However, IVC technique
is in-effective for larger circuits. Therefore, in this paper, we propose
two fast gate replacement algorithms together with optimal input vector
selection to simultaneously mitigate leakage power and NBTI induced
circuit degradation: Direct Gate Replacement (DGR) algorithm and
Divide and Conquer Based Gate Replacement (DCBGR) algorithm. Our
experimental results on 20 benchmark circuits at 65nm technology node
reveal that: 1) Both DGR and DCBGR algorithms outperform pure IVC
about on average 20% for three different object functions: leakage power
reduction only, NBTI mitigation only, and leakage/NBTI co-optimization.
2) The DCBGR algorithm leads to better optimization results and save
on average 100X runtime compared with the DGR algorithm.
-
Enabling Concurrent Clock and Power Gating in an Industrial Design Flow
[p. 334]
-
L. Bolzani, A. Calimera, A. Macii, E. Macii and M. Poncino
Clock-gating and power-gating have proven to be
very effective solutions for reducing dynamic and static power,
respectively. The two techniques may be coupled in such a way
that the clock-gating information can be used to drive the control
signal of the power-gating circuitry, thus providing additional
leakage minimization conditions w.r.t. those manually inserted by
the designer. This conceptual integration, however, poses several
challenges when moved to industrial design flows. Although
both clock and power-gating are supported by most commercial
synthesis tools, their combined implementation requires some
flexibility in the back-end tools that is not currently available.
This paper presents a layout-oriented synthesis flow which
integrates the two techniques and that relies on leading-edge,
commercial EDA tools. Starting from a gated-clock netlist, we
partition the circuit in a number of clusters that are implicitly
determined by the groups of cells that are clock-gated by the
same register. Using a row-based granularity, we achieve runtime
leakage reduction by inserting dedicated sleep transistors
for each cluster. The entire flow has been benchmarked on
a industrial design mapped onto a commercial, 65nm CMOS
technology library.
-
TRAM: A Tool for Temperature and Reliability Aware Memory Design
[p. 340]
-
A. Khajeh, A. Gupta, N. Dutt, F. Kurdahi, A. Eltawil, K. Khouri and M. Abadir
Memories are increasingly dominating Systems on
Chip (SoC) designs and thus contribute a large percentage of
the total system's power dissipation, area and reliability. In
this paper, we present a tool which captures the effects of
supply voltage Vdd and temperature on memory performance
and their interrelationships. We propose a Temperature- and
Reliability- Aware Memory Design (TRAM) approach which
allows designers to examine the effects of frequency, supply
voltage, power dissipation, and temperature on reliability in a
mutually interrelated manner. Our experimental results indicate
that thermal unaware estimation of probability of error can be
off by at least two orders of magnitude and up to five orders
of magnitude from the realistic, temperature-aware cases. We
also observed that thermal aware Vdd selection using TRAM can
reduce the total power dissipation by up to 2.5X while attaining
an identical predefined limit on errors.
Moderators: P. Manet, U Catholique de Lovain, BE; P. D'Abramo, Austriamicrosystems, AT
-
Aircraft Integration Real-Time Simulator Modeling with AADL for Architecture Tradeoffs
[p. 346]
-
J. Casteres and T. Ramaherirariny
In today's aircraft, system complexity increases are
making it particularly challenging for engineers to validate
systems architectures. To ease this burden, the integration test
rig, often known as the "iron bird" integration simulator, has
been developed, and allows testing of real systems in a simulated
environment.
The computing host platform and interface equipment used in
the integration simulator, are evolving rapidly. The capability to
predict the performance of both the simulation application and
the infrastructure on which it runs, is crucial in order to select
the proper architecture for the future test rigs.
This paper presents the results of an AADL development that
simulates the test rig simulator in order to predict its needs. We
illustrate the use of model based engineering techniques on a real
industrial application where we simulate the simulator in order
to architect its computing infrastructure.
Firstly, the simulation application model built with AADL
language is presented. Secondly, the producer-consumer
paradigm is introduced and it is shown how it is used to model
the simulation infrastructure host platform. Thirdly, the time
reference used to abstract time in the simulation is presented.
And finally the capacity of the AADL simulation to match the
simulators currently used in our company is illustrated.
Keywords: modeling, real-time, simulation, AADL.
-
A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems on
Programmable Chips
[p. 352]
-
M. Sonza Reorda, M. Violante, C. Meinhardt and R. Reis
The availability of multimillion Commercial-Off-The-Shelf
(COTS) Field Programmable Gate Arrays (FPGAs)
is making now possible the implementation on a single
device of complex systems embedding processor cores as
well as huge memories and ad-hoc hardware accelerators
exploiting the programmable logic (Systems on
Programmable Chip, or SoPCs). When deployed in
safety- or mission-critical applications, as avionic- and
space-oriented ones, Singe Event Effects (SEEs) affecting
COTS FPGA, which may have catastrophic effects if
neglected, have to be considered and SEE mitigation
techniques have to be employed. In this paper we explore
the adoption of known techniques (such as lockstep,
checkpointing and rollback recovery) for SEE mitigation
to processors cores embedded in SoPCs, and propose
their customization, specifically addressing the
characteristics of programmable devices. Since the
resulting design flow can easily be supported by
automation tools, its adoption is particularly suitable to
reduce the design and validation costs. Experimental
results show the effectiveness of the proposed approach
when compared to conventional TMR-based solutions.
-
Communication Minimization for In-Network Processing in Body Sensor Networks: A Buffer
Assignment Technique
[p. 358]
-
H. Ghasemzadeh, N. Jain, M. Sgroi and R. Jafari
Body sensor networks are emerging as a promising
platform for healthcare monitoring. These systems are composed
of battery-operated embedded devices which process physiological
data. The reduction in the power consumption is an important
factor to increase the lifetime for such systems and to enhance
their wearability through reducing the size of the battery. In
this paper, we develop an energy-efficient communication scheme
that uses buffers to reduce the number of transmissions among
the sensor nodes constrained to limited hardware resources. A
direct acyclic graph is used to model the information flow. We
define a communication optimization problem and solve it using
convex optimization techniques. We present results that support
the efficiency of the proposed technique.
-
A MEMS Reconfigurable Quad-Band Class-E Power Amplifier for GSM Standard
[p. 364]
-
L. Larcher, R. Brama, M. Ganzerli, J. Iannacci, M. Bedani and A. Gnudi
In this paper we present a reconfigurable Class-E
Power Amplifier (PA) whose operation frequency covers all uplink
bands of GSM standard. We describe the circuit design
strategy to reconfigure PA operation frequency maximizing the
efficiency. Two dies, manufactured using CMOS and MEMS
technologies, are assembled through bondwires in a SiP fashion.
Prototypes deliver 20dBm output power with 38% and 26%
drain efficiencies at lower and upper bands, respectively. MEMS
technological issues degrading performance are also discussed.
-
Power Reduction of A 12-Bit 40-MS/s Pipeline ADC Exploiting Partial Amplifier Sharing
[p. 369]
-
J.A. Díaz-Madrid, H. Neubauer, H. Hauer, G. Doménech-Asensi and R. Ruiz-Merino
High performance analog-to-digital converters (ADC)
are essential elements for the development of high performance
image sensors. These circuits need a big number of ADCs to
reach the required resolution at a specified speed. Moreover,
nowadays power dissipation has become a key performance to be
considered in analog designs, specially in those developed for
portable devices. Design of such circuits is a challenging task
which requires a combination of the most advanced digital
circuit, the analog expertise knowledge and an iterative design.
Amplifier sharing has been a commonly used technique to reduce
power dissipation in pipelined ADCs. In this paper we present a
partial amplifier sharing topology of a 12 bit pipeline ADC,
developed in 0.35μm CMOS process. Its performance is
compared with a conventional amplifier scaling topology and
with a fully amplifier sharing one.
Keywords- ADC, pipeline, CMOS, low-power
Organizer: L. Le Toumelin, Texas Instruments, FR
Moderator: J. Cong, UCLA, US
Panelists: J. Cong, G. Clave, T. Makelainen, Z. Zhang, V. Kathail and J. Kunkel
-
-
HLS tools are becoming fashionable again. Is this second wave of HLS the one industry will surf on? There are a
lot of technical, strategic and human related questions behind this HLS comeback. In this session the panellists from
System, S/C, EDA companies and Universities will look into the following questions: have designs become so
complex that they cannot avoid higher level of abstraction? Have HLS tools become so much better? Can HLS
yield entitlement gate count? Is HLS a new market? What are the practical issues with HLS production flows?
Does HLS improve design predictability and efficiency? What are the next challenges?
Moderators: A. Rubio, UP Catalunya, ES; E.J. Marinissen, IMEC, BE
-
Analyzing the Impact of Process Variations on Parametric Measurements: Novel Models
and Applications
[p. 375]
-
S. Reda and S. Nassif
In this paper we propose a novel statistical framework
to model the impact of process variations on semiconductor
circuits through the use of process sensitive test structures. Based
on multivariate statistical assumptions, we propose the use of
the expectation-maximization algorithm to estimate any missing
test measurements and to calculate accurately the statistical
parameters of the underlying multivariate distribution. We also
propose novel techniques to validate our statistical assumptions
and to identify any outliers in the measurements. Using the
proposed model, we analyze the impact of the systematic and
random sources of process variations to reveal their spatial
structures. We utilize the proposed model to develop a novel
application that significantly reduces the volume, time, and
costs of the parametric test measurements procedure without
compromising its accuracy. We extensively verify our models and
results on measurements collected from more than 300 wafers
and over 25 thousand die fabricated at a state-of-the-art facility.
We prove the accuracy of our proposed statistical model and
demonstrate its applicability towards reducing the volume and
time of parametric test measurements by about 2.5 - 6.1X at
absolutely no impact to test quality.
-
On Linewidth-Based Yield Analysis for Nanometer Lithography
[p. 381]
-
A. Sreedhar and S. Kundu
Lithographic variability and its impact on printability
is a major concern in today's semiconductor manufacturing
process. To address sub-wavelength printability, a number of
resolution enhancement techniques (RET) have been used. While
RET techniques allow printing of sub-wavelength features, the
feature width itself becomes highly sensitive to process
parameters, which in turn detracts from yield due to small
perturbations in manufacturing parameters. Yield loss is a
function of random variables such as depth-of-focus and
exposure dose. In this paper, we present a first order canonical
dose/focus model that takes into account both the correlated and
independent randomness of the effects of lithographic variation.
A novel tile-based yield estimation technique for a given layout,
based on a statistical model for process variability is presented.
Another novel contribution of this paper is the computation of
global and local line-yield probabilities. The key issues addressed
in this paper are (i) layout error modeling, (ii) avoidance of mask
simulation for chip layouts, (iii) avoidance of full Monte-Carlo
simulation for variational lithography modeling, (iv) building a
methodology for yield estimation based on existing commercial
tools. Numerical results based on our approach are shown for
45nm ISCAS85 layouts.
Keywords-Photolithography, depth-of-focus, exposure dose,
focus-exposure matrix (FEM), chemical mechanical polishing,
stratified sampling, linewidth-based yield.
-
Impact of Voltage Scaling on Nanoscale SRAM Reliability
[p. 387]
-
V. Chandra and R. Aitken
Low voltage SRAMs are critical for power constrained
designs. Currently, the choice of supply voltage in
SRAMs is governed by bit cell read static noise margin, writability,
data retention etc. However, in the nanometer technology
nodes, the choice of supply voltage impacts the reliability of
SRAMs as well. Two important reliability challenges for current
and future generation SRAMs are gate oxide degradation and
soft error susceptibility. The current generation transistors have
ultra-thin gate oxides to improve the device performance and
they are prone to breakdown due to higher level of electric field
stress. In addition, the soft error susceptibility of SRAMs has
significantly increased in the nanometer regime. In this work,
we have quantified the impact of voltage scaling on the soft
error susceptibility of gate oxide degraded SRAMs.We show that
when gate oxide degradation is taken into account, there exists an
optimal voltage (Vopt) at which the bit cell Qcrit is maximized.
Further, we show that both Vopt and Qcritmax are a function of
the level of oxide degradation. Finally, we investigate the impact
of technology node scaling and analyze the trend of Vopt and
Qcritmax. As the technology node shrinks to sub-45nm, both Vopt
and Qcritmax decrease sharply, thus significantly decreasing the
reliability of SRAMs.
Moderators: S. Yoo, POSTECH (Pohang U of Science and Technology), KR; A. Jerraya, CEA, FR
-
A File-System-Aware FTL Design for Flash-Memory Storage Systems
[p. 393]
-
P.-L. Wu, Y.-H. Chang and T.-W. Kuo
As flash memory became popular over various platforms,
there is a strong demand on the performance degradation
problem, due to the special characteristics of flash memory. This
research proposes the design of a file-system-aware flash translation
layer, in which a filter mechanism is designed to separate the
access requests of file-system metadata and file contents for better
performance. A recovery scheme is then proposed to maintain the
integrity of a file system. The proposed flash translation layer is
implemented as a Linux device driver and evaluated with respect to
ext2 and ext3 file systems. The experimental results show significant
performance improvement over ext2 and ext3 file systems with
limited system overheads
-
FSAF: File System Aware Flash Translation Layer for NAND Flash Memories
[p. 399]
-
S.K. Mylavarapu, S. Choudhuri, A. Shrivastava, J. Lee and A. Givargis
NAND Flash Memories require Garbage Collection (GC)
and Wear Leveling (WL) operations to be carried out by
Flash Translation Layers (FTLs) that oversee flash
management. Owing to expensive erasures and data
copying, these two operations essentially determine
application response times. Since file systems do not share
any file deletion information with FTL, dead data is treated
as valid by FTL, resulting in significant WL and GC
overheads. In this work, we propose a novel method to
dynamically interpret and treat dead data at the FTL level
so as to reduce above overheads and improve application
response times, without necessitating any changes to
existing file systems. We demonstrate that our resource-efficient
approach can improve application response times
and memory write access times by 22% and reduce
erasures by 21.6% on average.
-
A Set-Based Mapping Strategy for Flash-Memory Reliability Enhancement
[p. 405]
-
Y.-S. Chu, J.-W. Hsieh, Y.-H. Chang and T.-W. Kuo
With wide applicability of flash memory in various
application domains, reliability has become a very critical issue.
This research is motivated by the needs to resolve the lifetime
problem of flash memory and a strong demand in turning
thrown-away flash-memory chips into downgraded products.
We proposes a set-based mapping strategy with an effective
implementation and low resource requirements, e.g., SRAM.
A configurable management design and wear-leveling issue are
considered. The behavior of the proposed method is also analyzed
with respect to popular implementations in the industry.We show
that the endurance of flash memory can be significantly improved
by a series of experiments over a realistic trace. Our experiments
show that the read performance is even largely improved.
Moderators: M. Poncino, Politecnico di Torino, IT; J. Haid, Infineon Technologies, AT
-
Energy Efficient Multiprocessor Task Scheduling under Input-Dependent Variation
[p. 411]
-
J. Cong and K. Gururaj
In this paper, we propose a novel, energy aware
scheduling algorithm for applications running on DVS-enabled
multiprocessor systems, which exploits variation in execution times
of individual tasks. In particular, our algorithm takes into account
latency and resource constraints, precedence constraints among
tasks and input-dependent variation in execution times of tasks to
produce a scheduling solution and voltage assignment such that the
average energy consumption is minimized. Our algorithm is based
on a mathematical programming formulation of the scheduling and
voltage assignment problem and runs in polynomial time.
Experiments with randomly generated task graphs show that up to
30% savings in energy can be obtained by using our algorithm over
existing techniques. We perform experiments on two real-world
applications - MPEG-4 decoder and MJPEG encoder. Simulations
show that the scheduling solution generated by our algorithm can
provide up to 25% reduction in energy consumption over greedy
dynamic slack reclamation algorithms.
Index Terms - DVS, scheduling, average energy consumption,
precedence constraints, convex optimization
-
Program Phase and Runtime Distribution-Aware Online DVFS for Combined Vdd/Vbb Scaling
[p. 417]
-
J. Kim, S. Yoo and C.-M. Kyung
Complex software programs are mostly characterized
by phase behavior and runtime distributions. Due to the
dynamism of the two characteristics, it is not efficient to make
workload predictions during design-time. In our work, we present
a novel online DVFS method that exploits both phase behavior
and runtime distribution during runtime in combined Vdd/Vbb
scaling. The presented method performs a bi-modal analysis
of runtime distribution, and then a runtime distribution-aware
workload prediction based on the analysis. In order to minimize
the runtime overhead of the sophisticated workload prediction
method, it performs table lookups to the pre-characterized data
during runtime without compromising the quality of energy
reduction. It also offers a new concept of program phase suitable
for DVFS. Experiments show the effectiveness of the presented
method in the case of H.264 decoder with two sets of long-term
scenarios consisting of total 4655 frames. It offers 6.6% ~ 33.5%
reduction in energy consumption compared with existing offline
and online solutions.
-
ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design
Space Exploration
[p. 423]
-
A.B. Kahng, B. Li, L.-S. Peh and K. Samadi
As industry moves towards many-core chips, networks-on-chip
(NoCs) are emerging as the scalable fabric for interconnecting
the cores. With power now the first-order design constraint, early-stage
estimation of NoC power has become crucially important.
ORION [29] was amongst the first NoC power models released,
and has since been fairly widely used for early-stage power estimation
of NoCs. However, when validated against recent NoC
prototypes - the Intel 80-core Teraflops chip and the Intel Scalable
Communications Core (SCC) chip - we saw significant deviation
that can lead to erroneous NoC design choices. This
prompted our development of ORION 2.0, an extensive enhancement
of the original ORION models which includes completely
new subcomponent power models, area models, as well as improved
and updated technology models. Validation against the
two Intel chips confirms a substantial improvement in accuracy
over the original ORION. A case study with these power models
plugged within the COSI-OCC NoC design space exploration
tool [23] confirms the need for, and value of, accurate early-stage
NoC power estimation. To ensure the longevity of ORION 2.0,
we will be releasing it wrapped within a semi-automated flow that
automatically updates its models as new technology files become
available.
Organizer: P. Parrish, Sun Microsystems, US
Moderator: S. Mehta, Sun Microsystems, US
Panelists: J. Abraham, R. Goldman and J. McLean
-
-
You've heard about open source software, but what is open source hardware? Come hear experts from across the
industry and in academia discuss the new face of open source, Hardware IP:
- What is it?
- How are companies and academics using it?
- Can Open Source Hardware significantly change the design world?
Organizer: L. Anghel, TIMA Laboratory, FR
Moderator: G. Smith, US
SoC development requires interaction between a wide range of engineering disciplines. Each of which brings in
optimisation factors that impacts other disciplines. Therefore, concurrent development and end-to-end planning
between these disciplines are necessary. This session will show the overlap between design, packaging, silicon
manufacturing, test and yield optimisation.
Organizer/Moderator: S. Fujita, Toshiba, JP
-
Nano-electronics Challenge - Chip Designers Meet Real Nano-Electronics in 2010s?
[p. 431]
-
S. Fujita
During 1990s, silicon-based CMOS made steady
advancement with miniaturization and with lower power
consumption by incorporating the scaling effect and expanded
its share by invading the region of bipolar transistors and
compound semiconductors market. On the other hand, new
semiconductor application technologies grew rapidly one after
the other in conjunction with the development of silicone
CMOS technologies. Such developments included the
microprocessor for PC, server and router chipsets for internet
application, RF for cellular phones, analogue circuitry, base
band processors, and wireless LAN technologies. Also in
memory areas, the flash memory technology was introduced
into the market and FeRAM, MRAM, and PRAM
technologies with new principles were introduced into the
market.
-
MTJ-Based Nonvolatile Logic-in-Memory Circuit, Future Prospects and Issues
[p. 433]
-
S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, T. Endoh, H. Ohno and T. Hanyu
Non-volatile logic-in-memory architecture, where
nonvolatile memory elements are distributed over a logic-circuit
plane, is expected to realize both ultra-low-power and reduced
interconnection delay. This paper presents novel non-volatile
logic circuits based on logic-in-memory architecture using
magnetic tunnel junctions (MTJs) in combination with MOS
transistors. Since the MTJ with a spin-injection write capability
is only one device that has all the following superior features as
large resistance ratio, virtually unlimited endurance, fast
read/write accessibility, scalability, complementary MOS
(CMOS)-process compatibility, and nonvolatility, it is very suited
to implement the MOS/MTJ-hybrid logic circuit with logic-inmemory
architecture. A concrete nonvolatile logic-in-memory
circuit is designed and fabricated using a 0.18 μm CMOS/MTJ
process, and its future prospects and issues are discussed.
Keywords-nonvolatile; logic-in-memory; MTJ; standby-powerfree;
quick sleep/wake-up
-
Imperfection-Immune VLSI Logic Circuits Using Carbon Nanotube Field Effect Ttransistors
[p. 436]
-
S. Mitra, J. Zhang, N. Patil and H. Wei
Carbon Nanotube Field-Effect Transistors (CNFETs)
show big promise as extensions to silicon-CMOS because: 1)
Ideal CNFETs can provide significant energy and
performance benefits over silicon-CMOS, and 2) CNFET
processing is compatible with existing silicon-CMOS
processing. However, future gigascale systems cannot rely
solely on existing chemical synthesis for guaranteed ideal
devices. VLSI-scale logic circuits using CNFETs must
overcome major challenges posed by: 1) Misaligned and mis-positioned
Carbon Nanotubes (CNTs); 2) Metallic CNTs; and,
3) CNT density variations. This paper performs detailed
analysis of the impact of these challenges on CNFET circuit
performance. A combination of design and processing
techniques, presented this paper, can enable VLSI-scale
CNFET logic circuits that are immune to high rates of
inherent imperfections. These techniques are inexpensive
compared to traditional defect- and fault-tolerance, do not
impose major changes in VLSI design flows, and are
compatible with VLSI processing because they do not require
special customization on chip-by-chip basis.
-
Reconfigurable Circuit Design with Nanomaterials
[p. 442]
-
C. Dong, S. Chilstedt and D. Chen
It is generally acknowledged that nanoelectronics will
eventually replace traditional silicon CMOS in high-performance
integrated circuits. To that end, considerable investments are
being made in the research and development of new
nanoelectronic devices and fabrication techniques. When these
technologies mature, they can be used to create the next
generation of electronic systems. Given the intrinsic properties of
nanomaterials, such systems are likely to deviate considerably
from their predecessors. In this paper, we compare two potential
architectures for the design of nanoelectronic FPGAs. By
evaluating the performance of nanoelectronic devices at the
systems level, we aim to provide insights into how they can be
used effectively.
Keywords-FPGAs; nano-architecture; nanoelectronics; carbon
nanotube devices
Moderators: J. Quevremont, Thales, FR; L. Torres, LIRMM, Montpellier U/CNRS, FR
-
An Architecture for Secure Software Defined Radio
[p. 448]
-
C. Li, A. Raghunathan and N.K. Jha
Software defined radio (SDR) is a rapidly evolving
technology which implements some functional modules of a radio
system in software executing on a programmable processor.
SDR provides a flexible mechanism to reconfigure the radio,
enabling networked devices to easily adapt to user preferences
and the operating environment. However, the very mechanisms
that provide the ability to reconfigure the radio through software
also give rise to serious security concerns such as unauthorized
modification of the software, leading to radio malfunction and
interference with other users' communications. Both the SDR
device and the network need to be protected from such malicious
radio reconfiguration.
In this paper, we propose a new architecture to protect SDR
devices from malicious reconfiguration. The proposed architecture
is based on robust separation of the radio operation
environment and μuser application environment through the use
of virtualization. A secure radio middleware layer is used to
intercept all attempts to reconfigure the radio, and a security
policy monitor checks the target configuration against security
policies that represent the interests of various parties. Therefore,
secure reconfiguration can be ensured in the radio operation
environment even if the operating system in the user application
environment is compromised. We have prototyped the proposed
secure SDR architecture using VMware and the GNU Radio
toolkit, and demonstrate that the overheads incurred by the
architecture are small and tolerable. Therefore, we believe that
the proposed solution could be applied to address SDR security
concerns in a wide range of both general-purpose and embedded
computing systems.
-
Optimizing the HW/SW Boundary of an ECC SoC Design Using Control Hierarchy and
Distributed Storage
[p. 454]
-
X. Guo and P. Schaumont
Hardware/Software codesign of Elliptic Curve Cryptography
has been extensively studied in recent years. However,
most of these designs have focused on the computational aspect
of the ECC hardware, and not on the system integration into
a SoC architecture. We study the impact of the communication
link between CPU and coprocessor hardware for a typical ECC
design, and demonstrate that the SoC may become performance-limited
due to coprocessor data- and instruction-transfers. A dual
strategy is proposed to remove the bottleneck: introduction of
local control as well as local storage in the coprocessor. We quantify
the impact of this strategy on a prototype implementation
for Field Programmable Gate Arrays (FPGA) and measured an
average speed-up in the resulting design of 9.4 times over the
baseline ECC system, while the resulting system area increases
by a factor of 1.6. The optimal area-time product improvement
of our ECC coprocessor is 4.3 times compared to that of the
baseline ECC coprocessor. Using design space exploration of a
large number of system configurations using the latest FPGA
technology and tools, we show that the optimal choice of ECC
coprocessor parameters is strongly dependent on the efficiency
of system-level communication.
-
Hardware Aging-Based Software Metering
[p. 460]
-
F. Dabiri and M. Potkonjak
Reliable and verifiable hardware, software and content
usage metering (HSCM) are of primary importance for
wide segments of e-commerce including intellectual property and
digital rights management. We have developed the first HSCM
technique that employs intrinsic aging properties of components
in modern and pending integrated circuits (ICs) to create the first
self-enforceable HSCM approach. There are variety of hardware
aging techniques that range from electro-migration in wires to
slow-down of crystal-based clocks. We focus on transistor aging
due to negative bias temperature instability (NBTI) effects where
the delay of gates increases proportionally to usage times.
We address the problem of how we can measure the amount
of time a particular licensed software (LS) is used by designing
an aging circuitry and exposing it to the unique inputs associated
with each LS. If a particular LS is used longer than specified,
it automatically disables itself. Our novel HSCM technique uses
a multi-stage optimization problem of computing the delays of
gates, their aging degradation factors, and finally LS usage using
convex programming. The experimental results show not just
viability of the technique but also surprisingly high accuracy in
the presence of measurement noise and imperfect aging models.
HSCM can be used for many other business and engineering
applications such as power minimization, software evaluation,
and processor design.
Moderators: D. Sciuto, Politecnico di Milano, IT; M. Lajolo, NEC Laboratories, US
-
On-Chip Communication Architecture Exploration for Processor-Pool-Based MPSoC
[p. 466]
-
Y.-P. Joo, S. Kim and S. Ha
MPSoC is evolving towards processor-pool (PP)-based
architectures, which employ hierarchical on-chip network for
inter- and intra-PP communication. Since the design space of PP-based
MPSoC is extremely wide, application-specific
optimization of on-chip communication is a nontrivial task. This
paper presents a systematic methodology for on-chip network
design of PP-based MPSoC. The proposed approach allows
independent configurations of PPs, which leads to efficient
solutions than previous work. Since time-consuming simulation is
inevitable to evaluate complicated on-chip network during
exploration, we do early pruning of design space by a bandwidth
analysis technique that considers task execution dependencies.
Our approach yields the Pareto-optimal solutions between clock
frequency and area requirements. The experiments show that the
proposed technique finds more efficient architectures compared
with the previous approaches.
-
Combined System Synthesis and Communication Architecture Exploration for MPSoCs
[p. 472]
-
M. Lukasiewycz, M. Streubuehr, M. Glass, C. Haubelt and J. Teich
In this paper, a novel design space exploration approach
is proposed that enables a concurrent optimization of the
topology, the process binding, and the communication routing
of a system. Given an application model written in
SystemC TLM 2.0, the proposed approach performs a fully
automatic optimization by a simultaneous resource allocation,
task binding, data mapping, and transaction routing
for MPSoC platforms. To cope with the huge complexity of
the design space, a transformation of the transaction level
model to a graph-based model and symbolic representation
that allows multi-objective optimization is presented. Results
from optimizing a Motion-JPEG decoder illustrate the
effectiveness of the proposed approach.
-
UMTS MPSoC Design Evaluation Using a System Level Design Framework
[p. 478]
-
D. Densmore, A. Simalatsar, A. Davare, R. Passerone and A. Sangiovanni-Vincentelli
Rapid design space exploration with accurate models is necessary
to improve designer productivity at the electronic system level. We
describe how to use a new event-based design framework, Metro II,
to carry out simulation and design space exploration of multi-core
architectures. We illustrate the design methodology on a UMTS
data link layer design case study with both a timed and untimed
functional model as well as a complete set of MPSoC architectural
services. We compare different architectures (including RTOSes)
explored with Metro II and quantify the associated simulation overhead.
Moderators: P. Harrod, ARM, UK; G. Dinatale, LIRMM, FR
-
Fault-Tolerant Average Execution Time Optimization for General-Purpose Multi-Processor
System-on-Chips
[p. 484]
-
M. Vayrynen, V. Singh and E. Larsson
Fault-tolerance is due to the semiconductor technology
development important, not only for safety-critical systems
but also for general-purpose (non-safety critical) systems.
However, instead of guaranteeing that deadlines always
are met, it is for general-purpose systems important to
minimize the average execution time (AET) while ensuring
fault-tolerance. For a given job and a soft (transient) error
probability, we define mathematical formulas for AET that
includes bus communication overhead for both voting (active
replication) and rollback-recovery with checkpointing
(RRC). And, for a given multi-processor system-on-chip
(MPSoC), we define integer linear programming (ILP)
models that minimize AET including bus communication
overhead when: (1) selecting the number of checkpoints
when using RRC, (2) finding the number of processors
and job-to-processor assignment when using voting, and
(3) defining fault-tolerance scheme (voting or RRC) per
job and defining its usage for each job. Experiments demonstrate
significant savings in AET.
-
Improving Yield and Reliability of Chip Multiprocessors
[p. 490]
-
A. Pan, O. Khan and S. Kundu
An increasing number of hardware failures can be
attributed to device reliability problems that cause partial system
failure or shutdown. In this paper we propose a scheme for
improving reliability of a homogeneous chip multiprocessor
(CMP) that also serves to improve manufacturing yield. Our
solution centers on exploiting the natural redundancy that
already exists in multi-core systems by using services from other
cores for functional units that are defective in a faulty core. A
micro-architectural modification allows a core on a CMP to use
another core as a coprocessor to service any instruction that the
former cannot execute correctly. This service is accessed to
improve yield and reliability, but at the cost of some loss of
performance. In order to quantify this loss we have used a cycle-accurate
simulator to simulate the performance of a dual-core
system with one or two cores sustaining partial failure. Our
results indicate that when a large and sparingly-used unit such as
a floating point arithmetic unit fails in a core, even for a floating
point intensive benchmark, we can continue to run each faulty
core with help from companion cores with as little as 10% impact
to performance and less than 1% area overhead.
Keywords- yield; reliability; micorarchitecture; multiprocessors
-
A Unified Online Fault Detection Scheme Via Checking of Stability Violation
[p. 496]
-
G. Yan, Y. Han and X. Li
In ultra-deep submicro technology, two of the paramount reliability
concerns are soft errors and device aging. Although intensive studies
have been done to face the two challenges, most take them separately so
far, thereby failing to reach better performance-cost tradeoffs. To support
a more efficient design tradeoff, we present a new fault model, Stability
Violation, derived from analysis of signal behavior. Furthermore,
we propose a unified fault detection scheme - Stability Violation based
Fault Detection (SVFD), by which the soft errors (both Single Event
Upset and Single Event Transient), aging delay, and delay faults can
be uniformly handled. SVFD can greatly facilitate soft error-resistant
and aging-aware designs. SVFD is validated by conducting a set of
intensive Hspice simulations targeting 65nm CMOS technology. Experimental
results show that SVFD has more robust capability for fault
detection than previous schemes at comparable overhead in terms of
area, power, and performance.
-
Statistical Fault Injection: Quantified Error and Confidence
[p. 502]
-
R. Leveugle, A. Calvez, P. Maistri and P. Vanhauwaert
Fault injection has become a very classical method to
determine the dependability of an integrated system with respect
to soft errors. Due to the huge number of possible error
configurations in complex circuits, a random selection of a subset
of potential errors is usual in practical experiments. The main
limitation of such a selection is the confidence in the outcomes
that is never quantified in the articles. This paper proposes an
approach to quantify both the error on the presented results and
the confidence on the presented interval. The computation of the
required number of faults to inject in order to achieve a given
confidence and error interval is also discussed. Experimental
results are shown and fully support the presented approach.
Keywords-dependability analysis, statistical fault injection
Moderators: P. Pop, TU Denmark, DK; P. Eles, Linkoping U, SE
-
KAST: K-Associative Sector Translation for NAND Flash Memory in Real-Time Systems
[p. 507]
-
H. Cho, D. Shin and Y.I. Eom
Flash memory is a good candidate for the storage device
in real-time systems due to its non-fluctuating performance,
low power consumption and high shock resistance. However, the
garbage collection for invalid pages in flash memory can invoke
a long blocking time. Moreover, the worst-case blocking time
is significantly long compared to the best-case blocking time
under the current flash management techniques. In this paper,
we propose a novel flash translation layer (FTL), called KAST,
where user can configure the maximum log block associativity
to control the worst-case blocking time. Performance evaluation
using simulations shows that the overall performance of KAST is
better than the current FTL schemes as well as KAST guarantees
the longest block time is shorter than the specified value.
-
White Box Performance Analysis Considering Static Non-Preemptive Software Scheduling
[p. 513]
-
A. Viehl, M. Pressler, O. Bringmann and W. Rosenstiel
In this paper, a novel approach for integrating static
non-preemptive software scheduling in formal bottom-up performance
evaluation of embedded system models is described. The
presented analysis methodology uses a functional SystemC implementation
of communicating processes as input. Necessary model
extensions towards capturing of static non-preemptive scheduling
are introduced and the integration of the software scheduling in
the formal analysis process is explained. The applicability of the
approach in an automated design flow is presented using a SystemC
model of a JPEG encoder.
-
Application Specific Performance Indicators for Quantitative Evaluation of the Timing Behavior
for Embedded Real-Time Systems
[p. 519]
-
F. Koenig, D. Boers, F. Slomka, U. Margull, M. Niemetz and G. Wirrer
In the design and development of embedded realtime
systems the aspect of timing behavior plays a central role.
Especially, the evaluation of different scheduling approaches,
algorithms and configurations is one of the elementary preconditions
for creating not only reliable but also efficient systems - a
key for success in industrial mass production. This is becoming
even more important as multi-core systems are more and more
penetrating the world of embedded systems together with the
large (and growing) variety of scheduling policies available for
such systems. In this work simple mathematical concepts are used
to define performance indicators allowing to quantify the benefit
of different solutions of the scheduling challenge for a given
application. As a sample application some aspects of analyzing
the dynamic behavior of an combustion engine management
system for the automotive domain are shown. However, the
described approach is flexible in order to support the specific
optimization needs arising from the timing requirements defined
by the application domain and can be used with simulation data
as well as target system measurements.
-
Response-Time Analysis of Arbitrarily Activated Tasks in Multiprocessor Systems with
Shared Resources
[p. 524]
-
M. Negrean, S. Schliecker and R. Ernst
As multiprocessor systems are increasingly used in
real-time environments, scheduling and synchronization analysis
of these platforms receive growing attention. However, most
known schedulability tests lack a general applicability. Common
constraints are a periodic or sporadic task activation pattern,
with deadlines no larger than the period, or no support for shared
resource arbitration, which is frequently required for embedded
systems. In this paper, we address these constraints and present a
general analysis which allows the calculation of response times for
fixed priority task sets with arbitrary activations and deadlines
in a partitioned multiprocessor system with shared resources.
Furthermore, we derive an improved bound on the blocking
time in this setup for the case where the shared resources
are protected according to the Multiprocessor Priority Ceiling
Protocol (MPCP).
Moderators: T. Austin, U of Michigan, US; C. Kozyrakis, Stanford U, US
-
Light NUCA: A Proposal for Bridging the Inter-Cache Latency Gap
[p. 530]
-
D. Suarez, T. Monreal, F. Vallejo, R. Beivide and V. Vinals
To deal with the "memory wall" problem, microprocessors
include large secondary on-chip caches. But as these
caches enlarge, they originate a new latency gap between them
and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform
Cache Architectures (NUCAs) have been proposed
to sustain the size growth trend of secondary caches that is
threatened by wire-delay problems. NUCAs are size-oriented,
and they were not conceived to close the inter-cache latency gap.
To tackle this problem, we propose Light NUCAs (L-NUCAs)
leveraging on-chip wire density to interconnect small tiles through
specialized networks, which convey packets with distributed and
dynamic routing. Our design reduces the tile delay (cache access
plus one-hop routing) to a single processor cycle and places cache
lines at a finer granularity than conventional caches, reducing
cache latency. Our evaluations show that in general, an L-NUCA
improves simultaneously performance, energy, and area when
integrated into both conventional or D-NUCA hierarchies.
-
ReSiM, A Trace-Driven, Reconfigurable ILP Processor Simulator
[p. 536]
-
S. Fytraki and D. Pnevmatikatos
Modern processors are becoming more complex
and as features and application size increase, their evaluation
is becoming more time-consuming. To date, design space
exploration relies on extensive use of software simulation that
when highly accurate is slow.
In this paper we propose ReSim, a parameterizable ILP
processor simulation acceleration engine based on
reconfigurable hardware. We describe ReSim's trace-driven
microarchitecture that allows us to simulate the operation of a
complex ILP processor in a cycle serial fashion, aiming to
simplify implementation complexity and to boost operating
frequency. Being trace driven, ReSim can simulate timing in an
almost ISA independent fashion, and supports all SimpleScalar
ISAs, i.e. PISA, Alpha, etc.
We implemented ReSim for the latest Xilinx devices. In our
experiments with a 4-way superscalar processor ReSim
achieves a simulation throughput of up to 28MIPS, and offers
more than a factor of 5x improvement over the best reported
ILP processor hardware simulators.
-
Heterogeneous Coarse-Grained Processing Elements: A Template Architecture for Embedded
Processing Acceleration
[p. 542]
-
G. Ansaloni, P. Bonzini and L. Pozzi
Reconfigurable Architectures are good candidates for
application accelerators that cannot be set in stone at production
time. FPGAs however, often suffer from the area
and performance penalty intrinsic in gate-level reconfigurability.
To reduce this overhead, coarse-grained reconfigurable
arrays (CGRAs) are reconfigurable at the ALU level,
but a successful design needs more than computational
power - the main bottleneck usually being memory transfers.
Just like the integration of hardwired multiplier and
memory blocks enabled FPGAs to efficiently implement digital
signal processing applications, in this paper we study a
customizable architecture template based on heterogeneous
processing elements (multipliers, ALU clusters and memories)
that provides enough flexibility to realize fast pipelined
implementations of various loop kernels on a CGRA.
-
Algorithms for the Automatic Extension of an Instruction-Set
[p. 548]
-
C. Galuzzi, D. Theodoropoulos, R. Meeuws and K. Bertels
In this paper, two general algorithms for the automatic generation
of instruction-set extensions are presented. The basic instruction
set of a reconfigurable architecture is specialized with new
application-specific
instructions. The paper proposes two methods for the generation
of convex multiple input multiple output instructions, under hardware
resource constraints, based on a two-step clustering process. Initially,
the application is partitioned in single-output instructions of variable
size and then, selected clusters are combined in convex multiple output
clusters following different policies. Our results on well-known kernels
show that the extended Instructions-Set allows to execute applications
more efficiently and needing fewer cycles. Our results show that a
significant overall application speed-up is achieved even for large kernels
(for ADPCM decoder the speed-up is up to x2.2 and for TWOFISH
encoder the speedup is up to x5.5).
-
Dimensioning Heterogeneous MPSoCs via Parallelism Analysis
[p. 554]
-
B. Ristau, T. Limberg, O. Arnold and G. Fettweis
In embedded computing we face a continuously
growing algorithm complexity combined with a constantly rising
number of applications running on a single system. Multi-core
systems are becoming popular to cope with these requirements.
Growing computational complexity is handled by increasing the
number of cores and core types within one system - leading to
heterogeneous many-core MPSoCs in the near future. One key
challenge in designing such systems is to determine the number of
cores required to meet performance, power and area constraints.
In this paper we present a methodology that helps dimensioning
these systems via a novel parallelism analysis methodology within
seconds. The presented methodology has an average performance
estimation error of less than 4% compared to transaction level
simulation.
-
MPSoCs Run-Time Monitoring through Networks-on-Chip
[p. 558]
-
L. Fiorin, G. Palermo and C. Silvano
Networks-on-Chip (NoCs) have appeared as design
strategy to overcome the limitations, in terms of scalability,
efficiency, and power consumption of current buses. In this paper,
we discuss the idea of using NoCs to monitor system behaviour
at run-time by tracing activities at initiators and targets. Main
goal of the monitoring system is to retrieve information useful
for run-time optimization and resources allocation in adaptive
systems. Information detected by probes embedded within NIs
is sent to a central unit, in charge of collecting and elaborating
the data. We detail the design of the basic blocks and analyse
the overhead associated with the ASIC implementation of the
monitoring system, as well as discussing implications in terms of
the additional traffic generated in the NoC.
-
Assessing Fat-Tree Topologies for Regular Network-on-Chip Design under Nanoscale
Technology Constraints
[p. 562]
-
D. Ludovici, F. Gilabert, S. Medardoni, C. Gomez, M.E. Gomez, P. Lopez, G. Gaydadjiev
and D. Bertozzi
Most of past evaluations of fat-trees for on-chip interconnection
networks rely on oversimplifying or even irrealistic architecture
and traffic pattern assumptions, and very few layout analyses
are available to relieve practical feasibility concerns in nanoscale
technologies. This work aims at providing an in-depth assessment of
physical synthesis efficiency of fat-trees and at extrapolating siliconaware
performance figures to back-annotate in the system-level performance
analysis. A 2D mesh is used as a reference architecture
for comparison, and a 65 nm technology is targeted by our
study. Finally, in an attempt to mitigate the implementation cost
of k-ary n-tree topologies, we also review an alternative unidirectional
multi-stage interconnection network which is able to simplify
the fat-tree architecture and to minimally impact performance.
-
A Hybrid Packet-Circuit Switched On-Chip Network Based on SDM
[p. 566]
-
M. Modarressi, H. Sarbazi-Azad and M. Arjomand
In this paper, we propose a novel on-chip
communication scheme by dividing the resources of a traditional
packet-switched network-on-chip between a packet-switched and
a circuit-switched sub-network. The former directs packets
according to the traditional packet-switching mechanism, while
the latter forwards packets over circuits which are directly
established between two non-adjacent nodes by bypassing the
intermediate routers. A packet may switch between the subnetworks
several times to reach its destination. The circuits are
set up using a low-latency and low-cost setup-network. The
network resources are split between the two sub-networks using
Spatial-Division Multiplexing (SDM). The work aims to improve
the power and performance metrics of Network-on-Chip (NoC)
architectures and benefits from the power and scalability
advantage of packet-switched NoCs and superior communication
performance of circuit-switching. The evaluation results show a
significant reduction in power and latency over a traditional
packet-switched NoC.
Keywords-network-on-chip;circuit-switching;packet-switching.
-
SecBus: Operating System Controlled Hierarchical Page-Based Memory Bus Protection
[p. 570]
-
L. Su, S. Courcambec, P. Guillemin, C. Schwarz and R. Pacalet
This paper presents a new two-levels page-based
memory bus protection scheme. A trusted Operating System
drives a hardware cryptographic unit and manages security
contexts for each protected memory page. The hardware unit
is located between the internal system bus and the memory
controller. It protects the integrity and confidentiality of selected
memory pages. For better acceptability the processor (CPU)
architecture and the software application level are unmodified.
The impact of the security on cost and performance is optimized
by several algorithmic and hardware techniques and by a
differentiated handling of memory pages, depending on their
characteristics.
-
A Link Arbitration Scheme for Quality of Service in a Latency-Optimized Network-on-Chip
[p. 574]
-
J. Diemer and R. Ernst
Networks-on-chip (NoC) for general-purpose multiprocessors
require quality of service mechanisms to allow realtime
streaming applications to be executed along with latency-sensitive
general purpose processing tasks. In this paper, we
propose a NoC link arbitration technique that supports bandwidth
guarantees along with best effort latency optimizations.
In contrast to many existing quality of service mechanisms,
our technique prioritizes best effort over guaranteed bandwidth
traffic for optimal latency. Distributed traffic shaping is used to
offer bandwidth guarantees over previously reserved connections,
which are established dynamically using control messages. Initial
simulation results show that our arbitration scheme can provide
tight bandwidth guarantees for streaming traffic under network
overload conditions. At the same time, the average latency of best
effort traffic is improved compared to a simple prioritization of
streaming traffic.
-
Flow Regulation for On-Chip Communication
[p. 578]
-
Z. Lu, M. Millberg, A. Jantsch, A. Bruce, P. van Der Wolf and T. Henriksson
We propose (σ, ρ)-based flow regulation as a design
instrument for System-on-Chip (SoC) architects to control
quality-of-service and achieve cost-effective communication,
where σ bounds the traffic burstiness and ρ the traffic rate. This
regulation changes the burstiness and timing of traffic flows, and
can be used to decrease delay and reduce buffer requirements in
the SoC infrastructure. In this paper, we define and analyze the
regulation spectrum, which bounds the upper and lower limits
of regulation. Experiments on a Network-on-Chip (NoC) with
guaranteed service demonstrate the benefits of regulation. We
conclude that flow regulation may exert significant positive impact
on communication performance and buffer requirements.
-
Customizing IP Cores for System-on-Chip Designs Using Extensive External Don't Cares
[p. 582]
-
K.-H. Chang, V. Bertacco and I.L. Markov
Traditional digital circuit synthesis flows start from
an HDL behavioral definition and assume that circuit functions
are almost completely defined, making don't-care conditions rare.
However, recent design methodologies do not always satisfy these
assumptions. For instance, third-party IP blocks used in a system-on-chip
are often over-designed for the requirements at hand. By
focusing only on the input combinations occurring in a specific
application, one could resynthesize the system to reduce its area
and power consumption. Therefore we extend modern digital
synthesis with a novel technique, called SWEDE, that uses external
don't-cares present implicitly in existing simulation-based
verification environments for circuit customization. Experiments
indicate that SWEDE scales to large ICs with half-million input
vectors and handles practical cases well.
-
Extending IP-XACT to Support an MDE Based Approach For SoC Design
[p. 586]
-
A. El Mrabti, F. Petrot and A. Bouchhima
We are interested in the problem of improving ipreuse
in SoC design. This paper presents an MDE based
approach based on a proposed IP-XACT standard extension.
This approach combines the benefits of using MDE techniques in
SoC design such as abstraction levels definition and model
transformation for code generation, and the benefits of the IPXACT
standard such as a unique exchange format of packaged
IPs (Intellectual Property) with reuse capabilities.
-
Overcoming Limitations of the SystemC Data Introspection
[p. 590]
-
C. Genz and R. Drechsler
Today verification, testing and debugging of
SystemC models can be applied at an early stage in the
design process. To support these techniques gaining required
information of the respective model, the SystemC Verification
Library (SCV) implements a concept called data introspection.
Unfortunately data introspection holds problems that arise with
increasing usage of language features. Native C++ data types for
instance will not appear in meta-data extracted by introspection
facilities.
In this paper we propose a non-intrusive analysis concept to
overcome the drawbacks of traditional data introspection. The
presented approach is a hybrid technique joining a parser to
collect statical information and a code generator to evaluate run
time information.
Index Terms - SystemC, data introspection, analysis, intermediate
representation
-
Selective Light Vth Hopping (SLITH): Bridging the Gap between Run-Time Dynamic and Leakage
Power Reduction
[p. 594]
-
H. Xu, R. Vemuri and W.-B. Jone
Ever since the invention of various leakage power
reduction techniques, leakage and dynamic power reduction
techniques are categorized into two separate sets. Most of them
cannot be applied together during runtime. The gap between
them is due to the large energy breakeven time (EBT) and wakeup
time (WUT) of conventional leakage reduction techniques.
This paper proposes a new leakage reduction technique (SLITH)
based on Vth hopping. SLITH has very low EBT and WUT, yet
keeps the effectiveness of leakage reduction. Thus, it is able to
reduce the gap, and enables joint dynamic and leakage power
reduction. SLITH can be applied together with clock gating, precomputation
and operand isolation etc., and significantly reduces
both dynamic and active leakage power consumption.
Index Terms - runtime leakage power reduction, energy
breakeven time, wake-up time, Vth hopping
-
A Power-Efficient Migration Mechanism for D-NUCA Caches
[p. 598]
-
A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli and C.A. Prete
D-NUCA L2 caches are able to tolerate the increasing
wire delay effects due to technology scaling thanks to
their banked organization, broadcast line search and data
promotion/demotion mechanism. Data promotion
mechanism aims at moving frequently accessed data near
the core, but causes additional accesses on cache banks,
hence increasing dynamic energy consumption. We shown
how, in some cases, this migration mechanism is not
successful in reducing data access latency and can be
selectively and dynamically inhibited, thus reducing
dynamic energy consumption without affecting
performances.
Organizer: Y. Zorian, Virage Logic, US
Moderator: P. Aycinena, US
Panelists: A. Aznar, J.-A. Carballo, R. Madhavan, M. Merced, A. Shubat and R. Yavatkar
-
-
As the complexity of SoC design and manufacturing continues to grow, would it be better to have specialised
companies optimising their corresponding segments of the SoC development ecosystem or we rather have vertically
integrated companies that include all necessary specialties under a single roof.
-
Trends and Challenges in Wireless Application Processors
[p. 603]
-
P. Garnier
The rapid deployment of 3G wireless networks is accelerating the demand for application processors to deliver
multimedia-rich wireless services to end-customers. Texas Instruments has pioneered the way with OMAP(tm)
technology. Each generation of OMAP application processors has delivered breakthrough performance with ultralow
power consumption. This challenging combination has been achieved by applying state-of-the art power
management technologies to application processors manufactured with leading edge silicon technologies. The trend
towards more performance will continue to drive innovation.
Moderators: Y. Xie, Pennsylvania State U, US; P. Marchal, IMEC, BE
-
System-Level Process Variability Analysis and Mitigation for 3D MPSoCs
[p. 604]
-
S. Garg and D. Marculescu
While prior research has extensively evaluated the performance
advantage of moving from a 2D to a 3D design style, the impact of
process parameter variations on 3D designs has been largely ignored. In
this paper, we attempt to bridge this gap by proposing a variability-aware
design framework for fully-synchronous (FS) and multiple clock-domain
(MCD) 3D systems. First, we develop analytical system-level models of the
impact of process variations on the performance of FS 3D designs. The
accuracy of the model is demonstrated by comparing against transistor-level
Monte Carlo simulations in SPICE - we observe a maximum error
of only 0:7% (average 0:31% error) in the mean of the maximum
critical path delay distribution. Second, to mitigate the impact of process
variations on 3D designs, we propose a variability-aware 3D integration
strategy for MCD 3D systems that maximizes the probability of the
design meeting specified system performance constraints. The proposed
optimization strategy is shown to significantly outperform FS and MCD
3D implementations that are conventionally assembled - for example, the
MCD designs assembled with the proposed integration strategy provide,
on average, 44% and 16:33% higher absolute yield than the FS and
conventional MCD designs respectively, at the 50% yield point of the
conventional MCD designs.
-
Co-Design of Signal, Power, and Thermal Distribution Networks for 3D ICs
[p. 610]
-
Y.-J. Lee, Y.-J. Kim, G. Huang, M. Bakir, Y. Joshi, A. Fedorov and S.K. Lim
Heat removal and power delivery are two major reliability
concerns in the 3D stacked IC technology. Liquid cooling based on
micro-fluidic channels is proposed as a viable solution to dramatically
reduce the operating temperature of 3D ICs. In addition, designers use
a highly complex hierarchical power distribution network in conjunction
with decoupling capacitors to deliver currents to all parts of the 3D IC
while suppressing the power supply noise to an acceptable level. These
so called silicon ancillary technologies, however, pose major challenges
to routing completion and congestion. These thermal and power/ground
interconnects together with those used for signal delivery compete with
one another for routing resources including various types of Through-Silicon-Vias (TSVs).
This paper presents the work on routing with
these interconnects in 3D: signal, power, and thermal networks. We
demonstrate how to consider various physical, electrical, and thermo-mechnical
requirements of these interconnects to successfully complete
routing while addressing various reliability concerns.
-
Design of Compact Imperfection-Immune CNFET Layouts for Standard-Cell-Based Logic Synthesis
[p. 616]
-
S. Bobba, J. Zhang, A. Pullini, D. Atienza and G. De Micheli
The quest for technologies with superior device
characteristics has showcased Carbon Nanotube Field Effect
Transistors (CNFETs) into limelight. Among the several design
aspects necessary for today's grail in CNFET technology,
achieving functional immunity to Carbon Nanotube (CNT)
manufacturing issues (such as mispositioned CNTs and metallic
CNTs) is of paramount importance. In this work we present a
new design technique to build compact layouts while ensuring
100% functional immunity to mispositioned CNTs. Then, as
second contribution of this work, we have developed a CNFET
Design Kit (DK) to realize a complete design flow from logic-to-GDSII
traversing the conventional CMOS design flow. This flow
enables a framework that allows accurate comparison between
CMOS and CNFET-based circuits. This paper also presents
simulation results to illustrate such analysis, namely, a CNFET-based
inverter can achieve gains, with respect to the Energy-
Delay Product (EDP) metric, of more than 4x in delay, 2x in
energy/cycle and significant area savings (more than 30%) when
compared to a corresponding CMOS inverter benchmarked with
an industrial 65nm technology.
Keywords - Carbon Nanotube Transistors, Logic Synthesis,
CNT, Imperfection Immune, Misaligned Immune, CNFET.
-
Novel Library of Logic Gates with Ambipolar CNTFETs: Opportunities for Multi-Level Logic Synthesis
[p. 622]
-
M.H. Ben Jamaa, K. Mohanram and G. De Micheli
This paper exploits the unique in-field controllability of the device
polarity of ambipolar carbon nanotube field effect transistors (CNTFETs)
to design a technology library with higher expressive power
than conventional CMOS libraries. Based on generalized NORN-AND-AOI-OAI
primitives, the proposed library of static ambipolar
CNTFET gates efficiently implements XOR functions, provides
full-swing outputs, and is extensible to alternate forms with area-performance
tradeoffs. Since the design of the gates can be regularized,
the ability to functionalize them in-field opens opportunities
for novel regular fabrics based on ambipolar CNTFETs. Technology
mapping of several multi-level logic benchmarks - including
multipliers, adders, and linear circuits - indicates that on average,
it is possible to reduce both the number of gates and area by ~38%
while also improving performance by 6.9X
Moderators: M. O'Neill, Queen's U Belfast, IE; L. Fesquet, TIMA Laboratory, FR
-
Enhancing Correlation Electro-Magnetic Attack Using Planar Near-Field Cartography
[p. 628]
-
D. Real, F. Valette and M. Drissi
In the field of the Side Channel Analysis (SCA), the electromagnetic
radiation of a cryptographic device is the richest
source of information. Indeed, it permits to be more
accurate by positioning smartly the EM probe near a given
logic, filtering the signal that is not useful regarding a given
attack. But this advantage can become easily a drawback
if the attacker is unable to position her probe onto the
device. Our contribution consists in giving an accurate
method detecting an hot spot onto the device, i.e. the position
where a correlation electromagnetic attack (CEMA)
should be the most successful. This strategy is based on an
indicator evaluated during a cartography. Its performance
has been tested on an hardware AES implemented on an
Altera Stratix II.
-
Evaluation on FPGA of Triple Rail Logic Robustness against DPA and DEMA
[p. 634]
-
V. Lomne, P. Maurine, L. Torres, M. Robert, R. Soares and N. Calazans
Side channel attacks are known to be efficient
techniques to retrieve secret data. In this context, this paper
concerns the evaluation of the robustness of triple rail logic
against power and electromagnetic analyses on FPGA devices.
More precisely, it aims at demonstrating that the basic concepts
behind triple rail logic are valid and may provide interesting
design guidelines to get DPA resistant circuits which are also
more robust against DEMA.
Index Terms - DPA, CPA, DEMA Logic Style, DES, FPGA,
Side-Channel Attacks.
-
Successful Attack of an FPGA-Based WDDL DES Cryptoprocessor without Place and Route
Constraints
[p. 640]
-
L. Sauvage, S. Guilley, J.-L. Danger, Y. Mathieu and M. Nassar
In this paper, we propose a preprocessing method
to improve Side Channel Attacks (SCAs) on Dual-rail
with Precharge Logic (DPL) countermeasure family. The
strength of our method is that it uses intrinsic characteristics
of the countermeasure: classical methods fail when the
countermeasure is perfect, whereas our method still works
and enables us to perform advanced attacks.
We have experimentally validated the proposed method
by attacking a DES cryptoprocessor embedded in a Field
Programmable Gates Array (FPGA), and protected by the
Wave Dynamic Differential Logic (WDDL) countermeasure.
This successful attack, unambiguous as the full key is retrieved,
is the first to be reported.
Keywords: Side-Channel Analysis (SCA), Differential
Power Analysis (DPA), ElectroMagnetic Analysis (EMA),
Dual-rail with Precharge Logic (DPL), Wave Dynamic Differential
Logic (WDDL), Field Programmable Gates Array
(FPGA).
-
Hardware Evaluation of the Stream Cipher-Based Hash Functions Radiogatun and irRUPT
[p. 646]
-
L. Henzen, F. Carbognani, N. Felber and W. Fichtner
In the next years, new hash function candidates will replace
the old MD5 and SHA-1 standards and the current
SHA-2 family. The hash algorithms RadioGatún and ir-RUPT
are potential successors based on a stream structure,
which allows the achievement of high throughputs (particularly
with long input messages) with minimal area occupation.
In this paper, several hardware architectures of the two
above mentioned hash algorithms have been investigated.
The implementation on ASIC of RadioGat&ucute;n with a word
length of 64 bits shows a complexity of 46 k gate equivalents
(GE) and reaches 5.7 Gbps throughput with a 3.64-bit
input message. The same design approaches 120 Gbps
on ASIC with long input messages (63.4 Gbps on a
Virtex-4 FPGA with 2.9 kSlices). On the other hand, the irRUPT
core turns out to be the most compact circuit (only 5.8 kGE
on ASIC, and 370 Slices on FPGA) achieving 2.4 Gbps (with
long input messages) on ASIC, and 1.1 Gbps on FPGA.
Moderators: D. Pnevmatikatos, TU Crete, GR; L. Pozzi, Lugano U, IT
-
Architectural Support for Low Overhead Detection of Memory Violations
[p. 652]
-
S. Ghose, L. Gilgeous, P. Dudnik, A. Aggarwal and C. Waxman
Violations in memory references cause tremendous loss of
productivity, catastrophic mission failures, loss of privacy and
security, and much more. Software mechanisms to detect
memory violations have high false positive and negative rates
or huge performance overhead. This paper proposes architectural
support to detect memory reference violations in inherently
unsafe languages such as C and C++. In this approach,
the ISA is extended to include "safety" instructions
that provide compile-time information on pointers and objects.
The microarchitecture is extended to efficiently execute the
safety instructions. We explore optimizations, such as delayed
violation detection and stack-based handling of local pointers,
to reduce the performance overhead. Our experiments show
that the synergy between hardware and software results in this
approach having less than 5% average performance overhead,
while an exclusively software mechanism incurs 480%
impact for the same benchmarks.
-
CASPAR: Hardware Patching for Multi-Core Processors
[p. 658]
-
I. Wagner and V. Bertacco
Ensuring correctness of execution of complex multi-core processor
systems deployed in the field remains to this day an
extremely challenging task. The major part of this effort is
concentrated on design verification, where different pre- and
post-silicon techniques are used to guarantee that devices
behave exactly as stated in the specification. Unfortunately
the performance of even state-of-the-art validation tools lags
behind the growing complexity of multi-core designs. There
fore, subtle bugs still slip into released components, causing
incorrect computational results, or even compromising the
security of the end-user systems.
In this work we present Caspar - an approach for in-the
field patching of the memory subsystem hardware in multi
core chips. Caspar relies on a checkpointing system, which
periodically logs the state of the chip, and a novel error detection
and recovery scheme, which uses a simplified mode o
operation to bypass cache coherence and consistency errors
The implementation of Caspar employs hardware detectors
on-die programmable circuits to identify system's configurations
that may lead to bugs, and to trigger recovery and
bypass. Our experimental results show that Caspar can be
used effectively to detect and bypass a variety of memory
subsystem bugs, with as little as 2% performance impact
and 6% area overhead during bug-free operation.
-
A New Speculative Addition Architecture Suitable for Two's Complement Operations
[p. 664]
-
A. Cilardo
Existing architectures for speculative addition are
all based on the assumption that operands have uniformly
distributed bits, which is rarely verified in real applications.
As a consequence, they may be disadvantageous for real-world
workloads, although in principle faster than standard adders. To
address this limitation, we introduce a new architecture based on
an innovative technique for speculative global carry evaluation.
The proposed architecture solves the main drawback of existing
schemes and, evaluated on real-world benchmarks, it exhibits
an interesting performance improvement with respect to both
standard adders and alternative architectures for speculative
addition.
-
Limiting the Number of Dirty Cache Lines
[p. 670]
-
P. De Langen and B. Juurlink
Caches often employ write-back instead of writethrough,
since write-back avoids unnecessary transfers for multiple
writes to the same block. For several reasons, however,
it is undesirable that a significant number of cache lines will
be marked "dirty". Energy-efficient cache organizations, for
example, often apply techniques that resize, reconfigure, or turn
off (parts of) the cache. In such cache organizations, dirty lines
have to be written back before the cache is reconfigured. The
delay imposed by these write-backs or the required additional
logic and buffers can significantly reduce the attained energy
savings. A cache organization called the clean/dirty cache (CDcache)
is proposed that combines the properties of write-back
and write-through. It avoids unnecessary transfers for recurring
writes, while restricting the number of dirty lines to a hard limit.
Detailed experimental results show that the CD-cache reduces
the number of dirty lines significantly, while achieving similar
or better performance. We also use the CD-cache to implement
cache decay. Experimental results show that the CD-cache attains
similar or higher performance than a normal decay cache, while
using a significantly less complex design.
Organizer/Moderator: E.J. Marinissen, IMEC, BE
-
Contactless Testing: Possibility or Pipe-Dream?
[p. 676]
-
E.J. Marinissen, D.Y. Lee, J.P. Hayes, C. Sellathamby, B. Moore, S. Slupsky and L. Pujol
The traditionally wired interfaces of many electronic systems are in many applications being replaced by wireless interfaces.
Testing of electronic systems (both integrated circuits and printed circuit boards) still requires physical electrical contact through
probe needles and/or sockets. This paper addresses the state-of-the-art, options, and hurdles-still-to-take of contactless testing,
which would resolve many test challenges due to shrinking size and pitch of pads and pins and inaccessibility of advanced assembly
techniques as System-in-Package (SiP) and 3D stacked ICs.
Moderators: A. Girault, INRIA Rhone Alpes, FR; L. Almeida, Aveiro U, PT
-
Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors
[p. 682]
-
V. Izosimov, I. Polian, P. Pop, P. Eles and Z. Peng
In this paper we propose an approach to the design optimization of
fault-tolerant hard real-time embedded systems, which combines
hardware and software fault tolerance techniques. We trade-off
between selective hardening in hardware and process re-execution
in software to provide the required levels of fault tolerance against
transient faults with the lowest-possible system costs. We propose
a system failure probability (SFP) analysis that connects the
hardening level with the maximum number of re-executions in
software. We present design optimization heuristics, to select the
fault-tolerant architecture and decide process mapping such that
the system cost is minimized, deadlines are satisfied, and the
reliability requirements are fulfilled.
-
On Bounding Response Times under Software Transactional Memory in Distributed
Multiprocessor Real-Time Systems
[p. 688]
-
S.F. Fahmy, B. Ravindran and E.D. Jensen
We consider multiprocessor distributed real-time
systems where concurrency control is managed using
software transactional memory (or STM). For such a
system, we propose an algorithm to compute an upper
bound on the response time.The proposed algorithm
can be used to study the behavior of systems where node
crash failures are possible. We compare the result of the
proposed algorithm to a simulation of the system being
studied in order to determine its efficacy. The results
of our study indicate that it is possible to provide timeliness
guarantees for multiprocessor distributed systems
programmed using STM.
-
An Approximation Scheme for Energy-Efficient Scheduling of Real-Time Tasks in Heterogeneous
Multiprocessor Systems
[p. 694]
-
C.-Y. Yang, J.-J. Che, T.-W. Kuo and L. Thiele
As application complexity increases, modern embedded
systems have adopted heterogeneous processing elements to enhance
the computing capability or to reduce the power consumption. The
heterogeneity has introduced challenges for energy efficiency in hardware
and software implementations. This paper studies how to partition
real-time tasks on a platform with heterogeneous processing elements
(processors) so that the energy consumption can be minimized. The
power consumption models considered in this paper are very general
by assuming that the energy consumption with higher workload is larger
than that with lower workload, which is true for many systems. We
propose an approximation scheme to derive near-optimal solutions for
different hardware configurations in energy/power consumption. When
the number of processors is a constant, the scheme is a fully polynomial-time
approximation scheme (FPTAS) to derive a solution with energy
consumption very close to the optimal energy consumption in polynomial-time/
space complexity. Experimental results reveal that the proposed
scheme is very effective in energy efficiency with comparison to the
state-of-the-art algorithm.
Keywords: Multiprocessor scheduling, Heterogeneous multiprocessor,
Energy-efficient scheduling.
Moderators: T. Kazmierski, Southampton U, UK; L. Hedrich, J W Goethe U Frankfurt/M, DE
-
A Graph Grammar Based Approach to Automated Multi-Objective Analog Circuit Design
[p. 700]
-
A. Das and R. Vemuri
This paper introduces a graph grammar based
approach to automated topology synthesis of analog circuits. A
grammar is developed to generate circuits through production
rules, that are encoded in the form of a derivation tree. The
synthesis has been sped up by using dynamically obtained designsuitable
building blocks. Our technique has certain advantages
when compared to other tree-based approaches like GP based
structure generation. Experiments conducted on an opamp and
a vco design show that unlike previous works, we are capable
of generating both manual-like designs (bookish circuits) as well
as novel designs (unfamiliar circuits) for multi-objective analog
circuit design benchmarks.
-
Massively Multi-Topology Sizing of Analog Integrated Circuits
[p. 706]
-
P. Palmers, T. McConnaghy, M. Steyaert and G. Gielen
This paper demonstrates a system that performs multi-objective
sizing across 100,000 analog circuit topologies
simultaneously, with SPICE accuracy. It builds on a previous
system, MOJITO, which searches through 3500 topologies
defined by a hierarchically-organized set of 30 analog
blocks. This paper improves MOJITO's results quality
via three key extensions. First, it enlarges the block library
to enable symmetrical transconductance amplifiers
and more. Second, it improves initial topology diversity
via optimization-based constraint satisfaction. Third,
it maintains topology diversity during search via a novel
multi-objective selection mechanism, dubbed TAPAS. MO-JITO+TAPAS
is demonstrated on a problem with 6 objec-
tives, returning a tradeoff holding 17438 nondominated designs.
The tradeoff is comprised of 152 unique topologies that
include the newly-introduced topologies. 59 designs
across 12 topologies designs outperform an expert-designed
reference circuit.
-
Improved Performance and Variation Modelling for Hierarchical-Based Optimisation of Analogue
Integrated Circuits
[p. 712]
-
S. Ali, L. Ke, R. Wilcock and P. Wilson
A new approach in hierarchical optimisation is presented
which is capable of optimising both the performance and
yield of an analogue design. Performance and yield trade
offs are analysed using a combination of multi-objective
evolutionary algorithms and Monte Carlo simulations. A
behavioural model that combines the performance and
variation for a given circuit topology is developed which
can be used to optimise the system level structure. The
approach enables top-down system optimisation, not only
for performance but also for yield. The model has been
developed in Verilog-A and tested extensively with
practical designs using the Spectre simulator. A
performance and variation model of a 5 stage voltage
controlled ring oscillator has been developed and a PLL
design is used to demonstrate hierarchical optimisation at
the system level. The results have been verified with
transistor level simulations and suggest that an accurate
performance and yield prediction can be achieved with
the proposed algorithm.
-
Computation of IP3 Using Single-Tone Moments Analysis
[p. 718]
-
D. Tannir and R. Khazaka
Intermodulation distortion is one of the key design requirements
of Radio Frequency circuits. The standard approach
for analyzing distortion using circuit simulators is
to mimic measurement environments and compute the response
due to a two-tone input. This considerably increases
the CPU cost of the simulation because of the large number
of variables resulting from the harmonics of these two
tones and their intermodulation products. In this paper, we
propose an analytical method for directly obtaining the intermodulation
distortion from the Harmonic Balance equations
with a only single-tone input, without the need to
perform a Harmonic Balance simulation. The proposed
method is shown to be significantly faster than traditional
simulation based approaches.
Moderator: R. Popp, edacentrum, DE
-
Formal Approaches to Analog Circuit Verification
[p. 724]
-
E. Barke, D. Grabowski, H. Graeb, L. Hedrich, S. Heinen, R. Popp, S. Steinhorst and Y. Wang
For a speed-up of analog design cycles to keep
up with the continuously decreasing time to market, iterative
design refinement and redesigns are more than ever regarded
as showstoppers. To deal with this issue, referred to as design
and verification gap, the development of a continuous and
consistent verification is mandatory. In digital design, formal
verification methods are considered as a key technology for
efficient design flows. However, industrial availability of formal
methods for analog circuit verification is still negligible despite
a growing need. In recent years, research institutions have made
considerable advances in the area of formal verification of analog
circuits. This paper presents a selection of four recent approaches
in analog verification that cover a broad scope of verification
philosophies.
Organizer: L. Toda, Mentor Graphics, US
Moderator: W. Rhines, Mentor Graphics, US
-
While electronic system level (ESL) is being adopted in most electronic companies, there is still a need to explore
and adopt new methodologies for early design development. Where there have been some successes in the areas of
system analysis and virtual prototyping by SoC architects and software developers, investment costs for modelling
can be costly or scarce. Also, there is yet to be a standard for applying IP power modelling that fits with a TLM
terminology and into the overall system modelling process. Although power is hardly analyzed today even at the
RTL level, "access" to power at the ESL domain may become much more critical, given the impact designers can
have on power behaviour at this level. Ideally, software developers can also gain visibility into power dynamics and
adjust their development flow to accommodate power guidelines, as well. With the current isolated HW and SW
flows, this may seem unrealistic. What are some of the pitfalls of being "too early" in applying new technologies?
This panel will explore critical issues and possible solutions for designing power applications, enabling engineering
teams to rethink their approach to design planning using available ESL tools and methods.
Organizer/Moderator: Y. Xie, Pennsylvania State U, US
-
An Overview of Non-Volatile Memory Technology and the Implication for Tools and Architectures
[p. 731]
-
H. Li and Y. Chen
Novel nonvolatile memory technologies are gaining
significant attentions from semiconductor industry in the
competition of universal memory development. We used Spin-Transfer
Torque Random Access Memory (STT-RAM) and
Resistive Random Access Memory (R-RAM) as examples to
discuss the implication of emerging nonvolatile memory for tools
and architectures. Three aspects, including device and memory
cell modeling, device/circuit co-design consideration and novel
memory architecture, are discussed in details. The goal of these
discussions is to design a high-density, low-power, high-performance
nonvolatile memory with simple architecture and
minimized circuit design complexity.
Keywords - Universal memory; STT-RAM; R-RAM; MTJ
device modleing; memory yield improvement.
-
Power and Performance of Read-Write Aware Hybrid Caches with Non-volatile Memories
[p. 737]
-
X. Wu, J. Li, L. Zhang, E. Speight and Y. Xie
Caches made of non-volatile memory technologies,
such as Magnetic RAM (MRAM) and Phase-change RAM
(PRAM), offer dramatically different power-performance characteristics
when compared with SRAM-based caches, particularly
in the areas of static/dynamic power consumption, read and write
access latency and cell density. In this paper, we propose to
take advantage of the best characteristics that each technology
has to offer through the use of read-write aware Hybrid Cache
Architecture (RWHCA) designs, where a single level of cache can
be partitioned into read and write regions, each of a different
memory technology with disparate read and write characteristics.
We explore the potential of hardware support for intra-cache
data movement within RWHCA caches. Utilizing a full-system
simulator that has been validated against real hardware, we
demonstrate that a RWHCA design with a conservative setup
can provide a geometric mean 55% power reduction and yet 5%
IPC improvement over a baseline SRAM cache design across
a collection of 30 workloads. Furthermore, a 2-layer 3D cache
stack (3DRWHCA) of high density memory technology with the
same chip footprint still gives 10% power reduction and boost
performance by 16% IPC improvement over the baseline.
-
Using Non-Volatile Memory to Save Energy in Servers
[p. 743]
-
D. Roberts, T. Kgil and T. Mudge
Recent breakthroughs in circuit and process technology
have enabled new usage models for non-volatile memory
technologies such as Flash and phase change RAM (PCRAM) in
the general purpose computing environment. These technologies
display high density and low power consumption as well as
persistency that are appealing properties in a memory device.
This paper summarizes our earlier work on improving NAND
Flash based disk caches and extends it to consider PCRAM.
We first present the primary challenges in reliably managing
non-volatile memories such as NAND Flash, reviewing our
past work on architectural support for Flash manageability.
We then provide a preliminary analysis of how our current
Flash manageability architecture may be simplified when we
replace Flash with PCRAM. Our evaluations on PCRAM shows
a potential for more than a 65% throughput improvement for a
disk-intensive database workload. Although more detailed studies
are needed, we conclude that PCRAM is a strong contender to
replace Flash if it becomes cost-effective.
Moderators: V. Zaccaria, Politecnico di Milano, IT; F. Petrot, TIMA Laboratory, FR
-
aEqualized: A Novel Routing Algorithm for the Spidergon Network on Chip
[p. 749]
-
N. Concer, S. Iamundo and L. Bononi
We present the aEqualized routing algorithm:
novel algorithm for the Spidergon Network on Chip. AEqualized
combines the well known aFirst and aLast algorithms proposed
in literature obtaining an optimized use of the channels of the
network. This optimization allows to reduce the number of
channels actually implemented on the chip while maintaining
similar performances achieved by the two basic algorithms. In
the second part of this paper, we propose a variation on the
Spidergon's router architecture that enhances the performance
of the network especially when the aEqualized routing algorithm
is adopted.
-
Group-Caching for NoC Based Multicore Cache Coherent Systems
[p. 755]
-
W. Zuo, S. Feng, Z. Qi, J. Weixing, L. Jiaxin, D. Ning, X. Licheng, T. Yuan and Q. Baojun
Most CMPs use on-chip networks to connect cores and
tend to integrate more simple cores on a single die. Low-radix
networks, such as 2D-MESH, are widely used in tiled CMPs since
they can be mapped to on-chip networks efficiently. However,
low-radix networks introduce high network latency caused by
long diameter. In this paper, we propose the use of group-caching
design in NoC based multicore cache coherent systems. In our
design, on-chip L2 banks are organized to form multiple groups.
Each cache group behaves like a shared L2 cache for the cores
inside cache group while the cache coherence between cache
groups is maintained by coherence messages. Besides, group-caching
also adopts the new cache replacement policy to improve
the inefficient use of the aggregate L2 cache capacity. Compared
to banked and shared L2 design, as most L2 accesses are served
by local cache group, the hop count is significantly reduced.
Experiment results based on full-system simulation show that for
2D-MESH, group-caching can increase the performance by
2%~8% compared to banked and shared L2 design, with
network energy consumption reduced by 11%~13%. Experiment
results also show that the communication overhead inside cache
group plays an important role in the performance of group-caching.
Keywords-CMP; NOC; network latency; L2 banks; cache
coherence; group-caching; performance; power
-
A Monitor Interconnect and Support Subsystem for Multicore Processors
[p. 761]
-
S. Madduri, R. Vadlamani, W. Burleson and R. Tessier
In many current SoCs, the architectural interface to on-chip
monitors is ad hoc and inefficient. In this paper, a new
architectural approach which advocates the use of a separate low-overhead
subsystem for monitors is described. A key aspect of this
approach is an on-chip interconnect specifically designed for
monitor data with different priority levels. The efficiency of our
monitor interconnect is assessed for a multicore system using both
an interconnect and a system-level simulator. Collected monitor
information is used by a dedicated processor to control the
frequency and voltage of individual multicore processors.
Experimental results show that the new low-overhead subsystem
facilitates employment of thermal and delay-aware dynamic voltage
and frequency scaling.
Moderators: L. Lavagno, Politecnico di Torino, IT; W. Kruijtzer, NXP Semiconductors, NL
-
A Real-Time Application Design Methodology for MPSoCs
[p. 767]
-
G. Beltrame, L. Fossati and D. Sciuto
This paper presents a novel technique for the modeling,
simulation, and analysis of real-time applications on Multi-Processor
Systems-on-Chip (MPSoCs). This technique is
based on an application-transparent emulation of OS primitives,
including support for RTOS elements. The proposed
methodology enables a quick evaluation of the real-time
performance of an application in front of different design
choices, including the study of system's behavior as tasks.
deadlines become stricter or looser. The approach has been
verified on a large set of multi-threaded benchmarks. Results
show that our methodology (a) enables accurate real-time
and responsiveness analysis of parallel applications
running on MPSOCs, (b) allows the designer to devise an
optimal interrupt distribution mechanism for the given application,
and (c) helps dimensioning the system to meet
performance and real-time needs.
-
Adaptive Prefetching for Shared Cache Based Chip Multiprocessors
[p. 773]
-
M. Kandemir, Y. Zhang and O. Ozturk
Chip multiprocessors (CMPs) present a unique scenario
for software data prefetching with subtle tradeoffs between
memory bandwidth and performance. In a shared L2 based
CMP, multiple cores compete for the shared on-chip cache
space and limited off-chip pin bandwidth. Purely software based
prefetching techniques tend to increase this contention, leading
to degradation in performance. In some cases, prefetches can
become harmful by kicking out useful data from the shared
cache whose next usage is earlier than the prefetched data,
and the fraction of such harmful prefetches usually increases
when we increase the number of cores used for executing a
multi-threaded application code. In this paper, we propose two
complementary techniques to address the problem of harmful
prefetches in the context of shared L2 based CMPs. These
techniques, namely, suppressing select data prefetches (if they
are found to be harmful) and pinning select data in the L2 cache
(if they are found to be frequent victim of harmful prefetches), are
evaluated in this paper using two embedded application codes.
Our experiments demonstrate that these two techniques are very
effective in mitigating the impact of harmful prefetches, and as a
result, we extract significant benefits from software prefetching
even with large core counts.
-
CUFFS: An Instruction Count Based Architectural Framework for Security of MPSoCs
[p. 779]
-
K. Patel, S. Parameswaran and R. Ragel
Multiprocessor System on Chip (MPSoC) architecture is
rapidly gaining momentum for modern embedded devices. The vulnerabilities
in software on MPSoCs are often exploited to cause software
attacks, which are the most common type of attacks on embedded systems.
Therefore, we propose an MPSoC architectural framework, CUFFS, for
an Application Specific Instruction set Processor (ASIP) design that has a
dedicated security processor called iGuard for detecting software attacks.
The CUFFS framework instruments the source code in the application
processors at the basic block (BB) level with special instructions that allow
communication with iGuard at runtime. The framework also analyzes
the code in each application processor at compile time to determine
the program control flow graph and the number of instructions in each
basic block, which are then stored in the hardware tables of iGuard. The
iGuard uses its hardware tables to verify the applications' execution at
runtime. For the first time, we propose a framework that probes the
application processors to obtain their Instruction Count and employs an
actively engaging security processor that can detect attacks even when an
application processor does not communicate with iGuard.
CUFFS relies on the exact number of instructions in the basic block to
determine an attack which is superior to other time-frame based measures
proposed in the literature. We present a systematic analysis on how CUFFS
can thwart common software attacks. Our implementation of CUFFS on
the Xtensa LX2 processor from Tensilica Inc. had a worst case run-time
penalty of 44% and an area overhead of about 28%.
Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Analysis and Design
Aids
General Terms
Design, Performance, Security
Keywords
Architecture, Instruction Count, MPSoC, Attacks, Tensilica
Moderators: S. Kundu, Massachusetts U, US; M. Violante, Politecnico di Torino, IT
-
Design as You See FIT: System-Level Soft Error Analysis of Sequential Circuits
[p. 785]
-
D. Holcomb, W. Li and S.A. Seshia
Soft errors in combinational and sequential elements of digital
circuits are an increasing concern as a result of technology
scaling. Several techniques for gate and latch hardening
have been proposed to synthesize circuits that are tolerant
to soft errors. However, each such technique has associated
overheads of power, area, and performance. In this paper,
we present a new methodology to compute the failures in
time (FIT) rate of a sequential circuit where the failures are at
the system-level. System-level failures are detected by monitors
derived from functional specifications. Our approach includes
efficient methods to compute the FIT rate of combinational
circuits (CFIT), incorporating effects of logical, timing,
and electrical masking. The contribution of circuit components
to the FIT rate of the overall circuit can be computed from the
CFIT and probabilities of system-level failure due to soft errors
in those elements. Designers can use this information to
perform Pareto-optimal hardening of selected sequential and
combinational components against soft errors. We present experimental
results demonstrating that our analysis is efficient,
accurate, and provides data that can be used to synthesize a
low-overhead, low-FIT sequential circuit.
-
Detecting Errors Using Multi-Cycle Invariance Information
[p. 791]
-
N. Alves, K. Nepal, J. Dworak and R.I. Bahar
Ensuring reliable computation at the nanoscale
requires mechanisms to detect and correct errors during normal
circuit operation. In this paper we propose a method for designing
efficient online error detection schemes for circuits based on
the identification of invariant relationships in hardware. More
specifically, we present a technique that automatically identifies
multi-cycle gate-level invariant relationships - where no knowledge
of high-level behavioral constraints is required to identify
the relationships - and generates the checker logic that verifies
these implications. Our results show that cross-cycle implications
are particularly useful in discovering difficult-to-detect errors
near latch boundaries, and can have a significant impact on
boosting error detection rates.
-
A Novel Approach to Entirely Integrate Virtual Test into Test Development Flow
[p. 797]
-
P. Lu, D. Glaser, G. Uygur and K. Helmreich
In this paper, we present an open architecture
Virtual Test Environment (VTE) which can be easily integrated
into various modularized Automatic Test Systems
(ATS) compliant to Open Standard Architecture (OSA). The
focus of this paper is to analyze and address the major issues
that still prevent the application of Virtual Test (VT) from
day-to-day's practice. As a pilot demonstration, a VHDLAMS
based VTE is established and an ADC test is performed.
The environment is intended to seamlessly interoperate
with the test system during test program development
procedure.
Keywords - Virtual Test, Test generation, Simulation,
Hardware description language, VHDL, ATML, IEEE1641
Moderators: P. Felber, Neuchatel U, CH; C. Schlaeger, AMD, DE
-
Robust Non-Preemptive Hard Real-Time Scheduling for Clustered Multicore Platforms
[p. 803]
-
M. Lombardi, M. Milano and L. Benini
Scheduling task graphs under hard (end-to-end)
timing constraints is an extensively studied NP-hard problem of
critical importance for predictable software mapping on Multiprocessor
System-on-chip (MPSoC) platforms. In this work we
focus on an off-line (design-time) version of this problem, where
the target task graph is known before execution time. We address
the issue of scheduling robustness, i.e. providing hard guarantees
that the schedule will meet the end-to-end deadline in presence of
bounded variations of task execution times expressed as min-max
intervals known at design time. We present a robust scheduling
algorithm that proactively inserts sequencing constraints when
they are needed to ensure that execution will have no inserted idle
times and will meet the deadline for any possible combination of
task execution times within the specified intervals. The algorithm
is complete, i.e. it will return a feasible graph augmentation if
one exists. Moreover, we provide an optimization version of the
algorithm that can compute the shortest deadline that can be
met in a robust way.
-
Efficient OpenMP Support and Extensions for MPSoCs with Explicitly Managed Memory Hierarchy
[p. 809]
-
A. Marongiu and L. Benini
OpenMP is a de facto standard interface of the shared
address space parallel programming model. Recently, there have been
many attempts to use it as a programming environment for embedded
MultiProcessor Systems-On-Chip (MPSoCs). This is due both to the ease
of specifying parallel execution within a sequential code with OpenMP
directives, and to the lack of a standard parallel programming method
on MPSoCs. However, MPSoC platforms for embedded applications
often feature non-uniform, explicitly managed memory hierarchies with
no hardware cache coherency as well as heterogeneous cores with
heterogeneous run-time systems.
In this paper we present an optimized implementation of the compiler
and runtime support infrastructure for OpenMP programming for a
non-cache-coherent distributed memory MPSoC with explicitly managed
scratchpad memories (SPM). The proposed framework features specific
extensions to the OpenMP programming model that leverage explicit
management of the memory hierarchy. Experimental results on different
real-life applications confirm the effectiveness of the optimization in terms
of performance improvements.
-
Using Randomization to Cope with Circuit Uncertainty
[p. 815]
-
H. Safizadeh, M. Tahghighi, E.K. Ardestani, G. Tavasoli and K. Bazargan
Future computing systems will feature many cores
that run fast, but might show more faults compared to existing
CMOS technologies. New software methodologies must be
adopted to utilize communication bandwidth and the computational
power of few slow, reliable cores that could be employed
in such systems to verify the results of the fast, faulty cores.
Employing the traditional Triple Module Redundancy (TMR)
at core instruction level would not be as effective due to its
blind replication of computations. We propose two software
development methods that utilize what we call Smart TMR
(STMR) and fingerprinting to statistically monitor the results
of computations and selectively replicate computations that
exhibit faults. Experimental results show significant speedup
and reliability improvement over traditional TMR approaches.
-
Process Variation Aware Thread Mapping for Chip Multiprocessors
[p. 821]
-
S. Hong, S.H.K. Narayanan, M. Kandemir and O. Ozturk
With the increasing scaling of manufacturing technology,
process variation is a phenomenon that has become more
prevalent. As a result, in the context of Chip Multiprocessors
(CMPs) for example, it is possible that identically-designed processor
cores on the chip have non-identical peak frequencies and power
consumptions. To cope with such a design, each processor can
be assumed to run at the frequency of the slowest processor,
resulting in wasted computational capability. This paper considers
an alternate approach and proposes an algorithm that intelligently
maps (and remaps) computations onto available processors so that
each processor runs at its peak frequency. In other words, by
dynamically changing the thread-to-processor mapping at runtime,
our approach allows each processor to maximize its performance,
rather than simply using chip-wide lowest frequency amongst all
cores and highest cache latency. Experimental evidence shows that,
as compared to a process variation agnostic thread mapping strategy,
our proposed scheme achieves as much as 29% improvement in
overall execution latency, average improvement being 13% over
the benchmarks tested. We also demonstrate in this paper that
our savings are consistent across different processor counts, latency
maps, and latency distributions.With the increasing scaling of
manufacturing technology, process variation is a phenomenon that
has become more prevalent. As a result, in the context of Chip
Multiprocessors (CMPs) for example, it is possible that
identically-designed
processor cores on the chip have non-identical peak
frequencies and power consumptions. To cope with such a design,
each processor can be assumed to run at the frequency of the
slowest processor, resulting in wasted computational capability. This
paper considers an alternate approach and proposes an algorithm
that intelligently maps (and remaps) computations onto available
processors so that each processor runs at its peak frequency. In other
words, by dynamically changing the thread-to-processor mapping
at runtime, our approach allows each processor to maximize its
performance, rather than simply using chip-wide lowest frequency
amongst all cores and highest cache latency. Experimental evidence
shows that, as compared to a process variation agnostic thread
mapping strategy, our proposed scheme achieves as much as 29%
improvement in overall execution latency, average improvement
being 13% over the benchmarks tested. We also demonstrate in
this paper that our savings are consistent across different processor
counts, latency maps, and latency distributions.
Moderators: H. Graeb, TU Munich, DE; D. Stroobandt, Ghent U, BE
-
Gate Sizing for Large Cell-Based Designs
[p. 827]
-
S. Held
Today, many chips are designed with predefined
discrete cell libraries. In this paper we present a new fast gate
sizing algorithm that works natively with discrete cell choices
and realistic timing models. The approach iteratively assigns
signal slew targets to all source pins of the chip and chooses
discrete layouts of minimum size preserving the slew targets.
Using slew targets instead of delay budgets, accurate estimates
for the input slews are available during the sizing step. Slew
targets are updated by an estimate of the local slew gradient.
To demonstrate the effectiveness, we propose a new heuristic to
estimate lower bounds for the worst path delay. On average, we
violate these bounds by 6%. A subsequent local search decreases
this gap quickly to 2%. This two-stage approach is capable of
sizing designs with more than 5.8 million cells within 2.5 hours
and thus helping to decrease turn-around times of multi-million
cell designs.
-
Multi-Domain Clock Skew Scheduling-Aware Register Placement to Optimize Clock
Distribution Network
[p. 833]
-
N. MohammadZadeh, M. Mirsaeedi, A. Jahanian and M.S. Zamani
Multi-domain clock skew scheduling is a cost
effective technique for performance improvement. However, the
required wire length and area overhead due to phase shifters for
realizing such clock scheduler may be considerable if registers
are placed without considering assigned skews. Focusing on this
issue, in this paper, we propose a skew scheduling-aware register
placement algorithm that enables clock tree optimization by
considering domains assigned to registers in placement. Our
experimental results show that the proposed approach
remarkably decreases clock wire length and clock network
power consumption at the cost of a slight increase in total wire
length.
-
Decoupling Capacitor Planning with Analytical Delay Model on RLC Power Grid
[p. 839]
-
Y. Tao and S.K. Lim
Decoupling capacitors (decaps) are typically used to reduce
the noise in the power supply network. Because the delay of gates and
interconnects is affected by the supply voltage level, decaps can be used
to improve the circuit performance as well. In this paper, we present
the analytical delay model under IR drop, Ldi/dt noise, and decaps to
study how decaps affect both the gate and interconnect delay. Given a
floorplanning solution, we study how to allocate the whitespace for decap
insertion so that the delay is minimized under the given noise and area
constraint. We employ the Sequential Linear Programming method to
solve the non-linear whitespace allocation problem. Our experimental
results show that intelligent decap allocating decap makes further delay
reduction possible without adding any additional decap.
-
Package Routability-and IR-Drop-Aware Finger/Pad Assignment in Chip-Package Co-Design
[p. 845]
-
C.-H. Lu, H.-M. Chen, C.-N. J. Liu and W.-Y. Shih
Due to increasing complexity of design interactions
between the chip, package and PCB, it is essential to consider
them at the same time. Specifically the finger/pad locations affect
the performance of the chip and the package significantly. In this
paper, we have developed techniques in chip-package codesign to
decide the locations of fingers/pads for package routability and
signal integrity concerns in chip core design. Our finger/pad
assignment is a two-step method: first we optimize the wire
congestion problem in package routing, and then we try to minimize
the IR-drop violation with finger/pad solution refinement.
The experimental results are encouraging. Compared with the
randomly optimized methods, our approaches reduce in average
42% and 68% of the maximum density in package and 10.61%
of IR-drop for test circuits.
Organizer: W. Mueller, Paderborn U, DES
Moderator: M. di Natale, Scuola S Anna, IT
-
Learning Early-Stage Platform Dimensioning from Late-Stage Timing Verification
[p. 851]
-
K. Richter, M. Jersak and R. Ernst
Today's innovations in the automotive sector are, to a
great extent, based on electronics. The increasing integration
complexity and stringent cost reduction goals turn E/E platform
design into a challenging task. Timing/performance is becoming a
key aspect of architecture design, because the platform must be
dimensioned to provide just the right amount of computing
power and network bandwidth, including reserves for future
extensions, in order to be cost efficient. In other words, it must be
as powerful as needed but as cheap as possible. Finding this sweet
spot is a key challenge. Therefore, OEMs and Tier-1 are in search
of new methods, processes, and timing analysis techniques that
assist in early platform design stages. In this paper, we
demonstrate how some selected techniques that are established
for verification (in late design stages) can also be used to guide
the design (in early stages). We present examples in the areas
ECU (OSEK), buses (CAN, FlexRay) and gated networks. Flow
and applicability aspects are highlighted. As a key result, we
show that and how we can learn from late-stage verification for
early-stage design. Finally, we also outline future challenges in
the area of multi-core ECUs.
-
The Influence of Real-time Constraints on the Design of FlexRay-based Systems
[p. 858]
-
S. Reichelt, O. Scheickl and G. Tabanoglu
This article describes important challenges regarding the
design, specification and implementation of FlexRay-based
automotive networks. The authors outline a design approach
that especially accounts for timing constraints of the
network, namely end-to-end and cycle timing constraints.
The schedule generation for electronic control units (ECU)
as well as bus entities is addressed and constraint compatibility
with basic FlexRay configuration properties is investigated.
The discussed design approach considers three practical
design challenges of the automotive industry: first, the
function-based cycle timing constraints and their dependency
to basic bus design is presented. Second, the challenge
of distributed development of modern on-board networks
by many different teams and an approach for collaboration
improvement is discussed. Finally, the third part describes
the configuration of time-triggered ECU schedules
with respect to different constraint types.
-
Time and Memory Tradeoffs in the Implementation of AUTOSAR Components
[p. 864]
-
A. Ferrari, M. Di Natale, G. Gentile and P. Gai
The adoption of AUTOSAR in the development of
automotive electronics can increase the portability and reuse of
functional components. Inside each component, the behavior is
represented by a set of runnables, defining reactions executed in
response to an event or periodic computations. The implementation
of AUTOSAR runnables in a concurrent program executing
as a set of tasks reveals several issues and trade-offs because
of the need to protect communication and state variables and
to ensure time determinism. We discuss some of these tradeoffs
and options and outline a problem formulation that can be used
to compute the solution with minimum memory requirements
executing within the deadlines.
-
Systolic Like Soft-Detection Architecture for 4x4 64-QAM MIMO System
[p. 870]
-
P. Bhagawat, R. Dash and G. Choi
MIMO systems (with multiple transmit and receive antennas)
are becoming increasingly popular, and many next-generation
systems such as WiMAX, 3-GPP LTE and IEEE802.11n wireless
LANs rely on the increased throughput of MIMO systems with
up to four antennas at receiver and transmitter. High throughput
implementation of the detection unit for MIMO systems is a significant
challenge especially for higher order modulation schemes. To
achieve superior Bit Error Rate(BER) or Frame Error Rate (FER)
performance, the detector has to provide soft values to advanced
Forward Error Correction (FEC) schemes like Turbo Codes. This
paper presents a systolic soft detector architecture for high
dimensional(eg. 4x4, 64-QAM) MIMO systems. A Single detector core
achieves, throughput of 215Mbps and power consumption of 23.6mW,
whiles using only 33.1K gate equivalent(for l2 norm). Impressive
SNR gains of almost 2dB are observed with respect to the hard
detection counterpart over a block fading channel(at an FER of 1%).
Additionally, the architecture can be stacked to give linear increase
in throughput with linear increase in hardware resources.
-
Co-Simulation Based Platform for Wireless Protocols Design Explorations
[p. 874]
-
A. Fourmique, B. Girodias, G. Nicolescu and E.M. Aboulhamid
Longer range, faster speed and stronger link are today's
wireless mandatory characteristics. Tremendous efforts are
being deployed to create new and improved wireless protocols. However,
these new protocols are being tested in harsh and uncontrolled
environments. Simulation tools help to capture the expected behavior,
but the proposed designs might not work in real life situations
due to lack of accurate simulation models. Testbed platforms are
able to test designs in real life settings, but the flexibility of the
design is reduced and design exploration becomes a complex task.
This paper presents a hybrid platform composed of a simulation tool
and a testbed environment, which makes it possible easily design and
accurately test new wireless protocols.
-
How To Speed-Up Your NLFSR-Based Stream Cipher
[p. 878]
-
E. Dubrova
Non-Linear Feedback Shift Registers (NLFSRs) have been
proposed as an alternative to Linear Feedback Shift Registers (LFSRs) for
generating pseudo-random sequences for stream ciphers. Conventional
NLFSRs use the Fibonacci configuration in which the feedback is applied
to the last bit only. In this paper, we show how to transform a Fibonacci
NLFSR into an equivalent NLFSR in the Galois configuration, in which
the feedback can be applied to every bit. Such a transformation can
potentially reduce the depth of the circuits implementing feedback
functions, thus decreasing the propagation time and increasing the
throughput.
-
A High Performance Reconfigurable Motion Estimation Hardware Architecture
[p. 882]
-
O. Tasdizen, H. Kukner, A. Akin and I. Hamzaoglu
Motion Estimation (ME) is the most computationally intensive
part of video compression and video enhancement systems. For
the recently available high definition frame sizes and high frame
rates, the computational complexity of full search (FS) ME
algorithm is prohibitively high, while the PSNR obtained by fast
search ME algorithms is low. Therefore, in this paper, we propose
a new ME algorithm and a high performance reconfigurable
systolic ME hardware architecture for efficiently implementing
this algorithm. The proposed ME algorithm performs up to three
different granularity search iterations in different size search
ranges based on the application requirements. Simulation results
showed that the proposed ME algorithm performs very close to FS
algorithm, even though it searches much fewer search locations
than FS algorithm. It outperforms successful fast search ME
algorithms by searching more search locations than these
algorithms. The proposed reconfigurable ME hardware is
implemented in VHDL and mapped to a low cost Xilinx
XC3S1500-5 FPGA. It works at 130MHz and is capable of
processing high definition and high frame rate video formats in
real time. Therefore, it can be used in flat panel displays for frame
rate conversion and de-interlacing, and in video encoders.
-
Partition-Based Exploration for Reconfigurable JPEG Designs
[p. 886]
-
P.G Potter, W. Luk and P. Cheung
This paper proposes a novel approach for design
space exploration by characterizing hardware sharing
based on the notion of a partition in set theory. Related
designs with different degrees of hardware sharing
can be captured concisely by a Hasse diagram, high-lighting
designs with shared building blocks. Hardware
sharing can be implemented in various ways, such as
component multiplexing, instruction-set processors, or
run-time reconfiguration. We illustrate how the proposed
approach can be applied to exploring the design
space for FPGA implementations of JPEG image compression.
-
Automated Synthesis of Streaming C Applications to Process Networks In Hardware
[p. 890]
-
S. van Haastregt and B. Kienhuis
The demand for embedded computing power is continuously
increasing and FPGAs are becoming very interesting
computing platforms, as they provide huge amounts of customizable
parallelism. However, programming them is challenging, let
alone from a high level language. In [1], the ESPAM methodology
was already presented to quickly obtain realizations on FPGAs
from sequential C code. The realization consists of a network
of processors and IP cores. In this approach, a problem was
that the IP cores had to be provided manually. In this paper, we
present an extension on the ESPAM methodology by incorporating
the industrial high level synthesis tool PICO from Synfora Inc.
In this way, we realize the automated generation of efficient
hardware implementations on FPGAs from a single sequential C
input specification of a streaming application. We demonstrate
our approach for the Sobel and QR applications.
-
Distributed Sensor For Steering Wheel Grip Force Measurement In Driver Fatigue Detection
[p. 894]
-
F. Baronti, F. Lenzi, R. Roncella and R. Saletti
This paper presents a low-cost and simple distributed
force sensor that is particularly suitable for measuring
grip force and hand position on a steering wheel. The sensor can
be used in automotive active safety systems that aim at detecting
driver's fatigue, which is a major issue to prevent road accidents.
The key point of our approach is to design a chain of sensor units,
each of them provided with some intelligence and general purpose
capabilities, so that it can serve as platform for integrating
different kinds of sensors into the steering wheel. A proof-of-concept
demonstration of the distributed sensor consisting of 16
units based on capacitive sensing elements has been realised and
preliminary results are presented.
-
Making DNA Self-Assembly Error-Proof: Attaining Small Growth Error Rates through Embedded
Information Redundancy
[p. 898]
-
S. Garcia and A. Orailoglu
DNA self-assembly is emerging as the most promising
technique for nanoscale self-assembly as it uses the simple,
yet precise rules of DNA binding to create macroscale assemblies
from nanoscale components. However, DNA self-assembly is also
highly error-prone and requires the use of error-resilience techniques
in order to unlock its potential. In this paper we propose
a technique for error-resilience that is based on information
redundancy but, in contrast to previous information redundancy
schemes, can achieve much higher resilience to growth errors. By
expanding the neighborhood from which redundant information
is taken, we can extend the distance that errors are propagated
and therefore increase the likelihood of the error being reversed.
Given a growth error rate of ε, we show that with a neighborhood
of only 2 we can reduce the error rate to ε3.64 for arbitrary
functions (as compared to ε2.33 previously achieved). Compared
with spatial redundancy approaches, our technique allows for
higher density nanostructures and has a greatly reduced assembly
time.
-
Machine Learning-Based Volume Diagnosis
[p. 902]
-
S. Wang and W. Wei
In this paper, a novel diagnosis method is proposed. The
proposed technique uses machine learning techniques instead
of traditional cause-effect and/or effect-cause analysis. The
proposed technique has several advantages over traditional diagnosis
methods, especially for volume diagnosis. In the proposed
method, since the time consuming diagnosis process is
reduced to merely evaluating several decision functions, run
time complexity is much lower than traditional diagnosismethods.
The proposed technique can provide not only high resolution
diagnosis but also statistical data by classifying defective
chips according to locations of their defects. Even with
highly compressed output responses, the proposed diagnosis
technique can correctly locate defect locations for most defective
chips. The proposed technique correctly located defects
for more than 90 % (86 %) defective chips at 50x (100x) output
compaction. Run time for diagnosing a single simulated
defect chip was only tens of milli-seconds.
-
Adaptive Idleness Distribution for Non-Uniform Aging Tolerance in Multiprocessor Systems-on-Chip
[p. 906]
-
F. Paterna, L. Benini, A. Acquaviva, F. Papariello, G. Desoli and M. Olivieri
In deep submicron designs of MultiProcessor Systems-on-Chip (MPSoC) architectures,
uncompensated within-die process variations and aging effects will lead to an
increasing uncertainty and unbalancing of expected core lifetimes. In this paper
we present an adaptive workload allocation strategy for run-time compensation
of variations- and againg-induced unbalanced core lifetimes by means of core
activity duty cycling. The proprosed techniques regulates the percentage of
idle time on short-expected-life cores to meet the platform lifttime target with
minimum performance degradation.
Experiments have been conducted on a multiprocessor simulator of a next-generation
industrial MPSoC platform for multimedia applications made of a general
purpose processor and programmable accelerators.
Organizer: G. Schreiner, The MathWorks GmbH, DES
Moderator: E. Schubert, ESIC GmbH, DES
Panelists: A. Jantsch, P. Urard, F. Schirrmeister, P. Mosterman, L. Le-Toumelin and C. Engblom
-
-
With the increasing complexity of designs the requirement for flexibility is also growing. This adds the aspect of
programmability to SoC designs. A typical SoC decomposes a system into components which are individually
specified. These components are in a pre-existing form that satisfies the specification or are custom-made. With the
needed flexibility the decision for components to be hardwired, programmable, or software-based need to be pushed
to the end of the design phase. The most desirable situation is that the composition of these components results in
the expected system behaviour. The rule, however, is that significant system integration effort is required to make
the composition of components operate as intended. To a large extent, this is because of cross-cutting concerns that
result from parafunctional characteristics often associated with the integration platform. Ideally, components should
be composable (i.e., their properties should not change when connected to other components) and the system should
be compositional (i.e., emergent system properties should be derivable from the component properties). Reality is
far removed from this situation.
Possible questions to the panel are ... What is the right language for the development? Why is the integration so
difficult? What do we need to do to remedy this? What do we need to do to support the architect's vision? How
should we describe systems so they can be more easily built in today's systems-of-systems world? If we need to
generate tests from specifications, what form should they be in? Why hasn't the concept of an executable
specification caught on?
Moderators: A. Macii, Politecnico di Torino, IT; T. Ishihara, Kyushu U, JP
-
Process Variation Aware SRAM/Cache for Aggressive Voltage-Frequency Scaling
[p. 911]
-
A. Sasan (M.A. Makhzan), H. Homayoun, A. Eltawil and F. Kurdahi
This paper proposes a novel Process Variation Aware SRAM
architecture designed to inherently support voltage scaling. The
peripheral circuitry of the SRAM is modified to selectively allow
overdriving a wordline which contains weak cell(s). This architecture
allows reducing the power on the entire array; however it selectively
trades power for correctness when rows containing weak cells are
accessed. The cell sizing is designed to assure successful read
operations. This avoids flipping the content of the cells when the
wordline is overdriven. Our simulations report 23% to 30%
improvement in cell access time and 31% to 51% improvement in cell
write time in overdriven wordlines. Total area overhead is negligible
(4%). Low voltage operation achieves more than 40% reduction in
dynamic power consumption and approximately 50% reduction in
leakage power consumption.
-
Single Ended 6T SRAM with Isolated Read-Port for Low-Power Embedded Systems
[p. 917]
-
J. Singh, D.K. Pradhan, S. Hollis, S.P. Mohanty and J. Mathew
This paper presents a six-transistor (6T) single-ended
static random access memory (SE-SRAM) bitcell with
an isolated read-port, suitable for low-VDD and low-power
embedded applications. The proposed bitcell has a better static
noise margin (SNM) and write-ability compared to a standard
6T bitcell and equivalent to an 8T bitcell [1]. An 8Kbit SRAM
module with the proposed and standard 6T bitcells is simulated,
including full blown parasitics using BPTM, 65nm CMOS
technology node to evaluate and compare different performance
parameters. The active power dissipation in the proposed 6T
design is 28% and 25% less, compared to standard 6T and 8T
SRAM modules respectively.
-
System-Level Power/Performance Evaluation of 3D Stacked Drams for Mobile Applications
[p. 923]
-
M. Facchini, T. Carlson, A. Vignon, M. Palcovic, F. Catthoor, W. Dehaene, L. Benini
and P. Marchal
Convergence of communication, consumer applications
and computing within mobile systems pushes memory
requirements both in terms of size, bandwidth and power
consumption. The existing solution for the memory bottleneck
is to increase the amount of on-chip memory. However,
this solution is becoming prohibitively expensive, allowing
3D stacked DRAM to become an interesting alternative for
mobile applications. In this paper, we examine the
power/performance benefits for three different 3D stacked
DRAM scenarios. Our high-level memory and Through Silicon
Via (TSV) models have been calibrated on state-of-theart
industrial processes. We model the integration of a logic
die with TSVs on top of both an existing DRAM and a
DRAM with redesigned transceivers for 3D. Finally, we take
advantage of the interconnect density enabled by 3D technology
to analyze an ultra-wide memory interface. Experimental
results confirm that TSV-based 3D integration is a
promising technology option for future mobile applications,
and that its full potential can be unleashed by jointly optimizing
memory architecture and interface logic.
-
A Novel DRAM Architecture as a Low Leakage Alternative for SRAM Caches in a 3D
Interconnect Context
[p. 929]
-
A. Vignon, S. Cosemans, W. Dehaene, P. Marchal and M. Facchini
This paper presents a DRAM architecture that
improves the DRAM performance/power trade-off to increase
their usability on low power chip design using 3D interconnect
technology. The use of a finer matrix subdivision and buffering
the bitline signal at the localblock level allows to reduce both the
energy per access and the access time. The obtained performances
match those of a typical low power SRAM, while achieving a
significant area and static power reduction compared to these
memories.
The 128 kb memory architecture proposed here achieves an
access time of 1.3 ns for a dynamic energy of less than 0.2 pJ
per bit. A localized refresh mechanism allows gaining a factor
of 10 in static power consumption associated with the cell, and
a factor of 2 in area, when compared with an equivalent SRAM.
Moderators: L. Anghel, TIMA Laboratory, FR; M. Coppola, STMicroelectronics, FR
-
A Case for Multi-Channel Memories in Video Recording
[p. 934]
-
E. Aho, J. Nikara, P.A. Tuominen and K. Kuusilinna
In video recording, ever increasing demands on
image resolution, frame rate, and quality necessitate a lot of
memory bandwidth and energy. This paper presents and
evaluates such a potential memory load in future handheld
multimedia devices. Based on the achieved simulation results,
the multi-channel memories provide the capability for high
bandwidth without excessive overhead in terms of energy
consumption. A full HDTV (1080p) quality video recording
with H.264/AVC encoding at 30 frames per second (fps) is
found here to require 4.3 GB/s memory bandwidth. According
to the simulations, this memory requirement can be fulfilled
with four 32-bit memory channels operating at 400 MHz and
consuming 345 mW of power. As another example, 400 MHz 8-channel
memory configuration is able to provide the required
bandwidth for video recording with up to 3840x2160@30 fps.
Die stacking is the technology thought to be able to provide the
required bandwidth, sufficiently low power consumption, and
the multi-channel memory organization.
-
High Level H.264/AVC Video Encoder Parallelization for Multiprocessor Implementation
[p. 940]
-
H.K. Zrida, A. Jemai, A.C. Ammari and M. Abid
H.264/AVC (Advanced Video Codec) is a new video
coding standard developed by a joint effort of the ITU-TVCEG
and ISO/IEC MPEG. This standard provides higher coding
efficiency relative to former standards at the expense of higher
computational requirements. Implementing the H.264 video
encoder for an embedded System-on-Chip (SoC) is a big
challenge. For an efficient implementation, we motivate the use of
multiprocessor platforms for the execution of a parallel model of
the encoder. In this paper, we propose a high-level independent
target-architecture parallelization methodology for the
development of an optimized parallel model of a H.264/AVC
encoder (i.e. a processes network model balanced in
communication and computation workload).
-
Temperature-Aware Scheduler Based on Thermal Behavior Grouping in Multicore Systems
[p. 946]
-
I. Yeo and E.J. Kim
Dynamic Thermal Management techniques have
been widely accepted as a thermal solution for their low cost
and simplicity. The techniques have been used to manage the
heat dissipation and operating temperature to avoid thermal
emergencies, but are not aware of application behavior in
Chip Multiprocessors (CMPs). In this paper, we propose a
temperature-aware scheduler based on applications' thermal
behavior groups classified by a K-means clustering method in
multicore systems. The application's thermal behavior group has
similar thermal pattern as well as thermal parameters.With these
thermal behavior groups, we provide thermal balances among
cores with negligible performance overhead. We implement and
evaluate our schemes in the 4-core (Intel Quad Core Q6600) and
8-core (two Quad Core Intel XEON E5310 processors) systems
running several benchmarks. The experimental results show that
the temperature-aware scheduler based on thermal behavior
grouping reduces the peak temperature by up to 8.C and5.C in
our 4-core system and 8-core system with only 12% and 7.52%
performance overhead, respectively, compared to Linux standard
scheduler.
-
Hardware/Software Co-design Architecture for Thermal Management of Chip Multiprocessors
[p. 952]
-
O. Khan and S. Kundu
The sustained push for performance, transistor
count, and instruction level parallelism has reached a point
where chip level power density issues are at the forefront of
design constraints. Many high performance computing
platforms are integrating several homogeneous or
heterogeneous processing cores on the same die to fit small
form factors. Due to the design limitations of using expensive
cooling solutions, such complex chip multiprocessors require
an architectural solution to mitigate thermal problems.
Many of the current systems deploy Dynamic Voltage and
Frequency Scaling (DVFS) to address thermal emergencies,
either within the Operating System or hardware. These
techniques have certain limitations in terms of response lag,
scalability, cost and being reactive. In this paper, we present
an alternative thermal management system to address these
limitations, based on hardware/software co-design
architecture. The results show that in the 65nm technology, a
predictive, targeted, and localized response to thermal events
improves a quad-core performance by an average of 50%
over conventional chip-level DVFS.
Moderators: F. Ferrandi, Politecnico di Milano, IT; C. Passerone, Politecnico di Torino, IT
-
Cross-Architectural Design Space Exploration Tool for Reconfigurable Processors
[p. 958]
-
L. Bauer, M. Shafique and J. Henkel
Processors that deploy fine-grained reconfigurable
fabrics to implement application-specific accelerators on-demand
obtained significant attention within the last decade.
They trade-off the flexibility of general-purpose processors
with the performance of application-specific circuits without
tailoring the processor towards a specific application domain
like Application Specific Instruction Set Processors (ASIPs).
Vast amounts of reconfigurable processors have been proposed,
differing in multifarious architectural decisions. However,
it has always been an open question, which of the proposed
concepts is more efficient in certain application and/or
parameter scenarios. Various reconfigurable processors were
investigated in certain scenarios, but never before a systematic
design space exploration across diverse reconfigurable processor
concepts has been conducted with the aim to aid a designer
of a reconfigurable processor.
We have developed a first-of-its-kind comprehensive design
space exploration tool that allows to systematically explore
diverse reconfigurable processors and architectural parameters.
Our tool allows presenting the first cross-architectural
design space exploration of multiple fine-grained reconfigurable
processors on a fair comparable basis. After categorizing
fine-grained reconfigurable processors and their relevant parameters,
we present our tool and an in-depth analysis of reconfigurable
processors within different relevant scenarios.
-
Automatically Mapping Applications to a Self-Reconfiguring Platform
[p. 964]
-
K. Bruneel, F. Abouelella and D. Stroobandt
The inherent reconfigurability of FPGAs enables us
to optimize an FPGA implementation in different time intervals
by generating new optimized FPGA configurations and reconfiguring
the FPGA at the interval boundaries. With conventional
methods, generating a configuration at run-time requires an
unacceptable amount of resources. In this paper, we describe
a tool flow that can automatically map a large set of applications
to a self-reconfiguring platform, without an excessive need for
resources at run-time. The self-reconfiguring platform is implemented
on a Xilinx Virtex-II Pro FPGA and uses the FPGA's
PowerPC as configuration manager. This configuration manager
generates optimized configurations on-the-fly and writes them
to the configuration memory using the ICAP. We successfully
used our approach to implement an adaptive 32-tap FIR filter
on a Xilinx XUP board. This resulted in a 40% reduction in
FPGA resources compared to a conventional implementation and
a manageable reconfiguration overhead.a
-
OSSS+R: A Framework for Application Level Modelling and Synthesis of Reconfigurable Systems
[p. 970]
-
A. Schallenberg, W. Nebel, A. Herrholz, P.A. Hartmann and F. Oppenheimer
Dynamic Partial Reconfiguration (DPR) is a promising
technology ready for use, enabling the design of more flexible
and efficient systems. However, existing design flows
for DPR are either low-level and complex or lack support
for automatic synthesis. In this paper, we present a SystemC
based modelling and synthesis flow using the OSSS+R
framework for reconfigurable systems. Our approach addresses
reconfiguration already on application level enabling
early exploration and analysis of the effects of DPR.
Moreover it also allows quick implementation of such systems
using our automatic synthesis flow. We demonstrate
our approach using an educational example.
-
Design Optimizations to Improve Placeability of Partial Reconfiguration Modules [p. 976]
-
M. Koester, W. Luk, J. Hagemeyer and M. Porrmann
In partially reconfigurable architectures, system
components can be dynamically loaded and unloaded allowing
resources to be shared over time. This paper focuses on the
relation between the design options of partial reconfiguration
modules and their placement at run-time. For a set of dynamic
system components, we propose a design method that optimizes
the feasible positions of the resulting partial reconfiguration
modules to minimize position overlaps. We introduce the concept
of subregions, which guarantees the parallel execution of a
certain number of partial reconfiguration modules for tiled
reconfigurable systems. Experimental results, which are based
on a Xilinx Virtex-4 implementation, show that at run-time the
average number of available positions can be increased up to 6:4
times and the number of placement violations can be reduced
up to 60:6%.
Moderators: S. Kajihara, Kyushu Institute of Technology, JP; A. Virazel, LIRMM, FR
-
Automated Data Analysis Solutions to Silicon Debug
[p. 982]
-
Y.-S. Yang, N. Nicolici and A. Veneris
Since pre-silicon functional verification is insufficient to detect all design
errors, re-spins are often needed due to malfunctions that escape
into the silicon. This paper presents an automated software solution to
analyze the data collected during silicon debug. The proposed methodology
analyzes the test sequences to detect suspects in both the spatial
and the temporal domain. A set of software debug techniques are proposed
to analyze the acquired data from the hardware testing and provide
suggestions for the setup of the test environment in the next debug
session. A comprehensive set of experiments demonstrate its effectiveness
in terms of run-time and resolution.
-
Efficient and Accurate Method for Intra-gate Defect Diagnoses in Nanometer Technology and
Volume Data
[p. 988]
-
A. Ladhar, M. Masmoudi and L. Bouzaida
Improving diagnosis resolution becomes very important
in nanometer technology. Nowadays, defects are affecting
gate and transistor level. In this paper, we present a new
method to volume diagnosis intra-gate defects affecting
standard cell Integrated Circuits (ICs). Our method can
identify the cause of failure of different intra-gate defects
such as bridge, open and resistive-open defects. Our
method gives accurate results since it is based on the use
of physical information extracted from library cells
layout. Our method can also locate intra-gate defects in
presence of multiple faults. Experimental results show the
efficiency of our approach to isolate injected defects on
industrial designs.
-
Selection of a Fault Model for Fault Diagnosis Based on Unique Responses
[p. 994]
-
I. Pomeranz and S.M. Reddy
We describe a preprocessing step to fault
diagnosis of an observed response obtained from a
faulty chip. In this step, a fault model for diagnosing
the observed response is selected. This step allows
fault diagnosis to be performed based on a single fault
model after identifying the most appropriate one. We
describe a specific implementation of this preprocessing
step based on what is referred to as the unique output
response of a fault model. As an example, we apply
it to the diagnosis of multiple stuck-at faults, selecting
between single and double stuck-at faults as the fault
model for diagnosis. Experimental results demonstrate
improvements compared to diagnosis based on
single stuck-at faults, and compared to diagnosis based
on both single and double stuck-at faults.
-
Improving Compressed Test Pattern Generation for Multiple Scan Chain Failure Diagnosis
[p. 1000]
-
X. Tang, R. Guo, W.-T. Cheng and S.M. Reddy
To reduce test data volumes, encoded tests and
compacted test responses are widely used in industry. Use of
test response compaction negatively impacts fault diagnosis
since the errors in responses due to defects which are
captured in scan cells are not directly observed. We propose
a simple and effective way to enhance the diagnostic
resolution achievable by production tests with minimal
increase in pattern counts. In this work we present
experimental results for the case of multiple scan chain
faults to demonstrate the effectiveness of the proposed
method.
Moderators: S. Hutcheson, Rolls-Royce, UK; W. Ecker, Infineon Technologies, DE
-
A Case Study in Distributed Deployment of Embedded Software for Camera Networks
[p. 1006]
-
F. Leonardi, A. Pinto and L.P. Carloni
We present an embedded software application for
the real-time estimation of building occupancy using a network
of video cameras. We analyze a series of alternative decompositions
of the main application tasks and profile each of
them by running the corresponding embedded software on three
different processors. Based on the profiling measures, we build
various alternative embedded platforms by combining different
embedded processors, memory modules and network interfaces.
In particular, we consider the choice of two possible network
technologies: ARCnet and Ethernet. After deriving an analytical
model of the network costs, we use it to complete an exploration
of the design space as we scale the number of video cameras
in an hypothetical building. We compare our results with those
obtained for two real buildings of different characteristics. We
conclude discussing the results of our case study in the broader
context of other camera-network applications.
-
pTest: An Adaptive Testing Tool for Concurrent Software on Embedded Multicore Processors
[p. 1012]
-
S.-W. Chang, K.-Y. Hsieh and J.K. Lee
More and more processor manufacturers have
launched embedded multicore processors for consumer electronics
products because such processors provide high performance
and low power consumption to meet the requirements of mobile
computing and multimedia applications. To effectively utilize
computing power of multicore processors, software designers
interest in using concurrent processing for such architecture. The
master-slave model is one of the popular programming models
for concurrent processing. Even if it is a simple model, the
potential concurrency faults and unreliable slave systems still
lead to anomalies of entire system. In this paper, we present an
adaptive testing tool called pTest to stress test a slave system and
to detect the synchronization anomalies of concurrent software in
the master-slave systems on embedded multicore processors. We
use a probabilistic finite-state automaton(PFA) to model the test
patterns for stress testing and shows how a PFA can be applied
to pTest in practice.
-
A Generic Platform for Estimation of Multi-Threaded Program Performance on Heterogeneous
Multiprocessor
[p. 1018]
-
A. Sahu, M. Balakrishnan and P.R. Panda
This paper deals with a methodology for software
estimation to enable design space exploration of heterogeneous
multiprocessor systems. Starting from fork-join representation
of application specification along with high level description of
multiprocessor target architecture and mapping of application
components onto architecture resource elements, it estimates the
performance of application on target multiprocessor architecture.
The methodology proposed includes the effect of basic compiler
optimizations, integrates light weight memory simulation and instruction
mapping for complex instruction to improve the accuracy
of software estimation. To estimate performance degradation due
to contention for shared resources like memory and bus, synthetic
access traces coupled with interval analysis technique is employed.
The methodology has been validated on a real heterogeneous
platform. Results show that using estimation it is possible to predict
performance with average errors of around 11%.
-
Networked Embedded System Applications Design Driven by an Middleware Environment
[p. 1024]
-
F. Fummi, G. Perbellini and N. Roncolato
The extreme heterogeneity of networked embedded platforms
makes both design and reuse of applications really
hard. These facts decrease portability. A middleware is
the software layer that allows to abstract the actual characteristics
of each embedded platform. Using a middleware
decreases the difficulty in designing applications, but
programming for different middlewares is still a barrier
to portability. This paper presents a design methodology
based on an abstract middleware environment that allows
to abstract even the services provided. This is gained by
allowing the designer to smoothly move across different design
paradigms. As a proof, the paper shows how to mix and
exchange applications between tuple-space and message-oriented
based middleware environments.
Organizers/Moderators: G. Gielen, KU Leuven, BE; W. Eberle, IMEC, BE
-
Health-Care Electronics: The Market, the Challenges, the Progress
[p. 1030]
-
W. Eberle, A.S. Mecheri, T.K. T. Nguyen, G. Gielen, R. Campagnolo, A. Burdett, C. Toumazou
and B. Volckaerts
Exploding health care demands and costs of aging and
stressed populations necessitate the use of more in-home
monitoring and personalized health care. Electronics hold great
promise to improve the quality and reduce the cost of health care.
The speakers in this hot topic session will discuss the field of
health care electronics from all aspects. First, the market of
health care electronics is described, and realities, trends and
hypes will be pointed out. The second presentation describes the
engineering challenges in ultra low-power disposable electronics
for wireless body sensor applications. Both the sensor aspects, the
related signal processing, and business models will be discussed.
The third presentation talks about embedded bio-stimulation
applications in cochlea implants, thereby highlighting the design
challenges in terms of power consumption and extreme reliability
of these devices. The final presentation discusses the application
of brain stimulation and recording with respect to artifact
reduction and field steering, and describes aspects of the
modeling and design strategy. In this way, this hot-topic session
offers the attendees a complete picture of the field of health-care
electronics, ranging from the business to the technological and
design aspects.
Keywords-health-care, medical electronics, implants, embedded
SoC, wireless body sensor networks, neural stimulation, eletrical
field modeling, FEM.
Moderators: C. Heer, Infineon Technologies, DE; L. Fanucci, Pisa U, IT
-
Design and Implementation of Scalable, Transparent Threads for Multi-Core Media Processor
[p. 1035]
-
T. Kodaka, S. Sasaki, T. Tokuyoshi, R. Ohyama, N. Nonogaki, K. Kitayama, T. Mori, Y. Ueda,
H. Arakida, Y. Okuda, T. Kizu, Y. Tsuboi and N. Matsumoto
In this paper, we propose a scalable and transparent
parallelization scheme using threads for multi-core processor. The
performance achieved by our scheme is scalable to the number of
cores, and the application program is not affected by the actual
number of cores.
For the performance efficiency, we designed the threads so that
they do not suspend and that they do not start their execution
until the data necessary for them are available. We implemented
our design using three modules: the dependency controller, which
controls dependencies among threads, the thread pool, which
manages the ready threads, and the thread dispatcher, which
fetches threads from the pool and executes them on the core.
Our design and implementation provide efficient thread
scheduling with low overhead. Moreover, by hiding the actual
number of cores, it realizes transparency. We confirmed the
transparency and scalability of our scheme by applying it to
the H.264 decoder program. With this scheme, modification of
application program is not necessary even if the number of cores
changes due to disparate requirements. This feature makes the
developing time shorter and contributes to the reduction of the
developing cost.
-
High Data Rate Fully Flexible SDR Modem
[p. 1040]
-
F. Kasperski, O. Pierrelee, F. Dotto and M. Sarlotte
With the multiplication of mobile and wireless
communication networks and standards, the physical layer of
communication systems (i.e. the modem part of the system) has to
be completely flexible. This assumption leads to the well known
Software Defined Radio concept which enables the
implementation and the deployment of different waveforms on
the same platform. This concept has been widely investigated
since the early 2000's mainly for processors and Sw approach but
less for reconfigurable Hw or DSP implementation. This paper
deals with a specific architecture and an innovative design
methodology which were designed within the framework of a
fully flexible high data rate Software Defined Radio wireless
modem. This approach is focused on the waveform part of the
system and its goal is to reach a fully flexible physical layer. In
case of modem evolutions or upgrades, it enables to avoid
significant rework and extra cost in term of waveform
development. Moreover the association of the right architecture
and the right methodology allows to master and to manage the
complexity of the modem (which presents several hundred
configurations available with different kind of parameters) and
permits to provide the needed flexibility. The development
methodology is based on a C/C++ approach which allows to
manage all the parameters at a system level. The architecture
coupled to this development methodology offers a high level of
modularity which enables to easily modify the waveform only in
replacing blocks by other blocks. The efficiency and the flexibility
of the modem is then obtained by designing not a single
waveform but a waveforms family.
Keywords - Physical layer, high data rate modem, mobile and
wireless communication, Software Defined radio (SDR), flexibility,
modularity, architecture, high level synthesis, FPGA.
-
Cross-Coupling in 65nm Fully Integrated EDGE System on Chip - Design and Cross-Coupling
Prevention of Complex 65nm SoC
[p. 1045]
-
P.-H. Bonnaud and G. Sommer
Entry mobile phone market is a mass volume segment
where the modem and application technologies are commoditized
and fully proven. Nevertheless the cost and power reduction
target continues to heavily drive leading edge innovations.
Semiconductor companies strive to integrate more and more
Printed Circuit Board (PCB) components into one single chip,
without discontinuing the technology node shrink roadmap, from
130nm down to 65nm. This duality between intampampegration level and
aggressive silicon feature size reduction generates an innovative
environment where design engineers must create new
methodologies to cope with complex cross coupling mechanisms
and additional power dissipation. This paper describes one aspect
of the design methodology to reduce the die and package crosstalks,
and focus on the package co-design flow. The chip being
considered is a 65nm single chip System on Chip (SoC) including
EDGE RF, Power Management Unit (PMU), Audio Front End
(AFE) and FM Radio (FMR) circuits.
Keywords: eWLB; Chip&Package Co-Design; SoC integration,
cross-coupling, aggressors, victims
Organizer: A Jerraya, CEA-LETI, FR
Moderators: G. Nicolescu, Polytechnique Montreal, CA; A. Jerraya, CEA-LETI, FR
Multicore SoCs integrate an increasing number of heterogeneous programmable units and sophisticated
communication interconnects. Unlike classic computers, the design of SoC includes the building of application
specific architecture and specific interconnect and other hardware components required to execute the software for a
well defined class of applications. In this case, the programming model hides both hardware and software interfaces
that may include sophisticated communication and synchronisation concepts to handle parallel programs running on
the processors. This embedded tutorial introduces the key technologies for the design of such complex devices.
Moderators: F. Angiolini, iNOCs, D. Atienza, Madrid Complutense U, ES
-
Latency Criticality Aware On-Chip Communication
[p. 1052]
-
Z. Li, J. Wu, L. Shang, R.P. Dick and Y. Sun
Packet-switched interconnect fabric is a promising
on-chip communication solution for many-core architectures. It
offers high throughput and excellent scalability for on-chip data
and protocol transactions. The main problem posed by this communication
fabric is the potentially-high and nondeterministic
network latency caused by router data buffering and resource
arbitration. This paper describes a new method to minimize on-chip
network latency, which is motivated by the observation that
only a small percentage of on-chip data and protocol traffic is
latency-critical. Existing work focusing on minimizing average
network latency is thus suboptimal. Such techniques expend
most of the design, area, and power overhead accelerating
latency-noncritical traffic for which there is no corresponding
application-level speedup.
We propose run-time techniques that identify latency-critical
traffic by leveraging network data-transaction and protocol
information. Latency-critical traffic is permitted to bypass router
pipeline stages and latency-noncritical traffic. These techniques
are evaluated via a router design that has been implemented using
TSMC 65nm technology. Detailed network latency simulation
and hardware characterization demonstrate that, for latency-critical
traffic, the proposed solution closely approximates the
ideal interconnect even under heavy load while preserving
throughput for both latency-critical and noncritical traffic.
-
In-Network Reorder Buffer to Improve Overall NoC Performance While Resolving the In-Order
Requirement Problem
[p. 1058]
-
W.-C. Kwon, S. Yoo, J. Um and S.-W. Jeong
Data-intensive functions on chip, e.g., codec, 3D graphics, pixel
processing, etc. need to make best use of the increased bandwidth of
multiple memories enabled by 3D die stacking via accessing
multiple memories in parallel. Parallel memory accesses with
originally in-order requirements necessitate reorder buffers to
avoid deadlock. Reorder buffers are expensive in terms of area and
power consumption. In addition, conventional reorder buffers suffer
from a problem of low resource utilization. In our work, we present
a novel idea, called in-network reorder buffer, to increase the
utilization of reorder buffer resource. In our method, we move the
reorder buffer resource and related functions from network
entry/exit points to network routers. Thus, the in-network reorder
buffers can be better utilized in two ways. First, they can be utilized
by other packets without in-order requirements while there are no
in-order packets. Second, even in-order packets can benefit from innetwork
reorder buffers by enjoying more shares of reorder buffers
than before. Such an increase in reorder buffer utilization enables
NoC performance improvement while supporting the original inorder
requirements. Experimental results with an industrial
strength DTV SoC example show that the presented idea improves
the total execution cycle by 16.9%.
-
An Efficent Dynamic Multicast Routing Protocol for Distributing Traffic in NoCs
[p. 1064]
-
M. Ebrahimi, M. Daneshtalab, M.H. Neishaburi, S. Mohammadi, A. Afzali-Kusha, J. Plosila
and H. Tenhunen
Nowadays, in MPSoCs and NoCs, multicast protocol
is significantly used for many parallel applications such as cache
coherency in distributed shared-memory architectures, clock
synchronization, replication, or barrier synchronization. Among
several multicast schemes proposed in on chip interconnection
networks, path-based multicast scheme has been proven to be
more efficient than the tree-based, and unicast-based. In this
paper a low distance path-based multicast scheme is proposed.
The proposed method takes advantage of the network partitioning,
and utilizing of an efficient destination ordering algorithm. The
results in performance, and power consumption show that the
proposed method outstands the previous on chip path-based
multicasting algorithms.
-
Priority Based Forced Requeue to Reduce Worst Case Latencies for Bursty Traffic
[p. 1070]
-
M. Millberg and A. Jantsch
In this paper we introduce Priority Based Forced
Requeue to decrease worst-case latencies in NoCs offering best
effort services. Forced Requeue is to prematurely lift out low priority
packets from the network and requeue them outside using priority
queues. The first benefit of this approach, applicable to any NoC
offering best effort services, is that packets that have not yet entered
the network now compete with packets inside the network and
hence tighter bounds on admission times can be given. The second
benefit - which is more specific to deflective routing as in the Nostrum
NoC - is that packet "reshuffling" dramatically reduces the
latency inside the network for bursty traffic due to a lowered risk of
collisions at the exit of the network. This paper studies the Forced
Requeuing on a mesh with varying burst sizes and traffic scenarios.
The experimental results show a 50% reduction in worst-case
latency from a system perspective thanks to a reshaped latency distribution
whilst keeping the average latency the same.
Moderators: L. Fanucci, Pisa U, IT; O. Bringmann, FZI Forschungszentrum Informatik, DE
-
Optimizations of an Application-Level Protocol for Enhanced Dependability in FlexRay
[p. 1076]
-
W. Li, M. Di Natale, W. Zheng, P. Giusto, A. Sangiovanni-Vincentelli and S.A. Seshia
FlexRay [9] is an automotive standard for high-speed
and reliable communication that is being widely deployed
for next generation cars. The protocol has powerful error-detection
mechanisms, but its error-management scheme
forces a corrupted frame to be dropped without any
notification to the transmitter. In this paper, we analyze
the feasibility of and propose an optimization approach for
an application-level acknowledgement and retransmission
scheme for which transmission time is allocated on top of
an existing schedule. We formulate the problem as a Mixed
Integer Linear Program. The optimization is comprised
of two stages. The first stage optimizes a fault tolerance
metric; the second improves scheduling by minimizing the
latencies of the acknowledgement and retransmission messages.
We demonstrate the effectiveness of our approach
on a case study based on an experimental vehicle designed
at General Motors.
-
Remote Measurement of Local Oscillator Drifts in FlexRay Networks
[p. 1082]
-
E. Armengaud and A. Steininger
Distributed systems, especially time-triggered ones,
are implementing clock synchronization algorithms to provide
and maintain a common view of time among the different nodes.
Such architectures heavily rely on the nodes' local oscillators
to remain within given accuracy bounds. However, measuring
the oscillator frequencies (e.g., for maintenance or diagnosis) is
usually difficult to perform since it requires physical access to
each single node and may interfere with the running application.
Moreover, clock synchronization features tend to mask clock
deviations. In this work, we propose a non-intrusive method for
remote measurement of the individual oscillator drifts within a
distributed system. Our approach is based on a tester that sends
carefully aligned messages to stimulate the clock synchronization
service and records the resulting bus traffic for an analysis of
the nodes' synchronization behavior. This tester needs access to
the communication bus only. We focus our work to FlexRay and
validate our approach by experiments.
-
CAN+: A New Backward-Compatible Controller Area Network (CAN) Protocol with up to 16x
Higher Data Rates
[p. 1088]
-
T. Ziermann, S. Wildermann and J. Teich
As the number of electronic components in automobiles
steadily increases, the demand for higher communication
bandwidth also rises dramatically. Instead of installing new
wiring harnesses and new bus structures, it would be useful,
if already available structures could be used, but driven
at higher data rates. In this paper, we a) propose an extension
of the well-known Controller Area Network (CAN)
called CAN+ with which the target rate of 1Mbit/s can be
increased up to 16 times. Moreover, b) existing CAN hardware
and devices not dedicated to these boosted data rates
can still be used without interferences on communication.
The major idea is a change of the protocol. In particular, we
exploit the fact that data could be sent in time slots, where
CAN-conform nodes don't listen. Finally, c) an implementation
of this type of overclocking scheme on an FPGA is
provided to prove the feasibility and the impressive throughput
gains.
-
Shock Immunity Enhancement via Resonance Damping in Gyroscopes for Automotive Applications
[p. 1094]
-
E. Marchetti, L. Fanucci, A. Rocchi and M. De Marinis
This paper presents an innovative and effective method
to improve the performance of a micro mechanical gyroscope by
introducing the damping of its sensing quality factor. Indeed the
sensing quality factor is a key parameter for the micro
mechanical gyroscope dynamic; particularly high sensing quality
factor means long settling time, high response overshoot and high
sensitivity to external disturbances (shocks and vibrations) that
are typical of harsh automotive environment. For this reason
micro mechanical gyroscopes employed in automotive
environment need high shock and vibration immunity. This
paper proposes a solution to reach this goal by adding a "virtual
damping" to the system with an electrostatic feedback technique.
This approach has been applied to a real automotive yaw gyro
system, and simulations performed using Simulink™
environment show an appreciable output overshoot reduction,
with the benefit of higher vibration immunity, once implemented
the feedback technique.
Micro mechanical gyroscope; electrostatic feedback technique;
shock immunity enhancement
-
Integration of an Advanced Emergency Call Subsystem into a Car-Gateway Platform
[p. 1100]
-
N. Martínez Madrid, R. Seepold, A. Reina Nieves, J. Saez Gomez, A. los Santos Aransay,
P. Sanz Velasco, C. Rueda Morales and F. Ares
Several European research projects in the vehicular
area address the enhancement of vehicular safety. In the frame
of the Caring Cars project, an on-board car-gateway embedded
architecture for safety and wellness applications has been
designed. This paper puts forward the essentials of this modular,
dynamic and robust architecture and defines in detail the
advanced emergency call (eCall+), one of the most innovative
applications in the project. By mean of the eCall+, the emergency
services will always be able to track the affected vehicle and
monitor the state of the car. The driver may also contact them
through videoconference in a critical situation. Thus, the system
can either prevent an accident or help the vehicle occupants and
the emergency services to safe the occupants' lives after an
accident occurred.
Keywords-component; eCall, eCall+, emergency, safety,
automotive, localization, services
Moderators: P. Ienne, EPF Lausanne, CH; R. Kastner, UC San Diego, US
-
Finite Precision Bit-Width Allocation Using SAT-Modulo Theory
[p. 1106]
-
A.B. Kinsman and N. Nicolici
This paper explores the use of SAT-Modulo Theory in
determination of bit-widths for finite precision implementation
of numerical calculations, specifically in the context of
scientific computing where division frequently occurs. Employing
SAT-Modulo Theory leads to more accurate bounds
estimation than those provided by other analytical methods,
in turn yielding smaller bit-widths.
-
HLS-L: High-Level Synthesis of High Performance Latch-Based Circuits
[p. 1112]
-
S. Paik, I. Shin and Y. Shin
An inherent performance gap between custom designs and
ASICs is one of the reasons why many designers still start their
designs from register transfer level (RTL) description rather
than from behavioral description, which can be synthesized
to RTL via high-level synthesis (HLS). Sequencing overhead
is one of the factors for this performance gap; the choice between
latch and flip-flop is not typically taken into account
during HLS, even though it affects all the steps of HLS. HLSl
is a new design framework that employs high-performance
latches during scheduling, allocation, and controller synthesis.
Its main feature is a new scheduler that is based on a
concept of phase step (as opposed to conventional control
step), which allows us scheduling in finer granularity, register
allocation that resolves the conflict of latch being read
and written at the same time, and controller synthesis that exploits
dual-edge triggered storage elements to support phase
step based scheduling. In experiments on benchmark designs
implemented in 1.2 V, 65-nm CMOS technology, HLS-l reduced
latency by 16.6% on average, with 9.5% less circuit
area, compared to the designs produced by conventional HLS.
-
Automatic Generation of Streaming Datapaths for Arbitrary Fixed Permutations
[p. 1118]
-
P.A. Milder, J.C. Hoe and M. Pueschel
This paper presents a technique to perform arbitrary
fixed permutations on streaming data. We describe a
parameterized architecture that takes as input n data points
streamed at a rate of w per cycle, performs a permutation over
all n points, and outputs the result in the same streaming format.
We describe the system and its requirements mathematically and
use this mathematical description to show that the datapaths
resulting from our technique can sustain a full throughput of
w words per cycle without stalling. Additionally, we provide an
algorithm to configure the datapath for a given permutation and
streaming width.
Using this technique, we have constructed a full synthesis
system that takes as input a permutation and a streaming width
and outputs a register-transfer level Verilog description of the
datapath. We present an evaluation of our generated designs
over varying problem sizes and streaming widths, synthesized
for a Xilinx Virtex-5 FPGA.
-
SEU-Aware Resource Binding for Modular Redundancy Based Designs on FPGAs
[p. 1124]
-
S. Golshan and E. Bozorgzadeh
Although Triple Modular Redundancy (TMR) has
been widely used to mitigate single event upsets (SEUs) in
SRAM-based FPGAs, SEU-caused bridging faults between the
TMR modules do not guarantee correctness of TMR design
under SEU. In this paper, we present a novel approximation
algorithm for resource binding on scheduled datapaths at the
presence of TMR, which aims at containment of each SEU within
a single replica of tripled operations. The key challenges are to
avoid resource sharing between modular redundant operations
and also to reduce the possibility of TMR masking breaches in
resource allocation. We introduce the notion of vulnerability gap
during resource sharing to potentially reduce the effort for white
space allocation at the physical design stage in order to avoid
bridging faults between TMR resources. The experimental
results show that our proposed resource binding algorithm,
followed by floorplanner, reduces the potential of TMR breaches
by 20%, on average.
Keywords: Triple modular redundancy; single event upset; high
level design; FPGA
Moderators: H. Obermeir, Infineon, DE; N. Nicolici, McMaster U, CA
-
Generation of Compact Test Sets with High Defect Coverage
[p. 1130]
-
X. Kavousianos and K. Chakrabarty
Multi-detect (N-detect) testing suffers from the drawback
that its test length grows linearly with N. We present a new
method to generate compact test sets that provide high defect
coverage. The proposed technique makes judicious use of a new
pattern-quality metric based on the concept of output deviations.
We select the most effective patterns from a large N-detect pattern
repository, and guarantee a small test set as well as complete
stuck-at coverage. Simulation results for benchmark circuits
show that with a compact, 1-detect stuck-at test set, the proposed
method provides considerably higher transition-fault coverage
and coverage ramp-up compared to another recently-published
method. Moreover, in all cases, the proposed method either out-performs
or is as effective as the competing approach in terms of
bridging-fault coverage and the surrogate BCE+ metric. In many
cases, higher transition-fault coverage is obtained than much
larger N-detect test sets for several values of N. Finally, our results
provide the insight that, instead of using N-detect testing
with as large N as possible, it is more efficient to combine the
output deviations metric with multi-detect testing to get high-quality,
compact test sets.
-
A Scalable Method for the Generation of Small Test Sets
[p. 1136]
-
S. Remersaro, J. Rajski, S.M. Reddy and I. Pomeranz
This paper presents a scalable method to generate
close to minimal size test pattern sets for stuck-at faults in
scan based circuits. The method creates sets of potentially
compatible faults based on necessary assignments. It guides
the justification and propagation decisions to create patterns
that will accommodate most targeted faults. The technique
presented achieves close to minimal test pattern sets for
ISCAS circuits. For industrial circuits it achieves much
smaller test pattern sets than other methods in designs
sensitive to decision order used in ATPG.
-
QC-Fill: An X-Fill Method for Quick-and-Cool Scan Test
[p. 1142]
-
C.-W. Tzeng and S.-Y. Huang
In this paper, we present an X-Fill (QC-Fill)
method for not only slashing the test time but also
reducing the test power (including both capture power and
shifting power). QC-Fill, built upon the existing
multicasting scan architecture, can coexist with most
low-capture-power (LCP) X-fill methods through a
multicasting-driven X-fill method incorporating a
clique-stripping scheme. QC-Fill is independent of the
ATPG patterns and does not require any area-overhead
since it can directly operate on an existing scan
architecture incorporating test compression.
Index Terms - Scan Test, Multicasting, Test Compression,
Low-Power Scan, Low-capture-power X-fill
Moderators: P. Mosterman, The MathWorks, US ; E. Villar, Cantabria U, ES
-
Exploring Parallelizations of Applications for MPSoC Platforms Using MPA
[p. 1148]
-
R. Baert, E. Brockmeyer, S. Wuytack and T.J. Ashby
This paper presents a tool for exploring different
parallelization options for an application. It can be used to
quickly find a high-quality match between an application and
a multi-processor platform architecture. By specifying the parallelization
at a high abstraction level, and leaving the actual
source code transformations to the tool, a designer can try out
many parallelizations in a short time. A parallelization may use
either functional or data-level splits, or a combination of both.
An accompanying high-level simulator provides rapid feedback
about the expected performance of a parallelization, based on
platform parameters and profiling data of the sequential application
on the target processor. The use of the tool and simulator
are demonstrated on an MPEG-4 video encoder application and
two different platform architectures.
-
An MDE Methodology for the Development of High-Integrity Real-Time Systems
[p. 1154]
-
S. Mazzini, S. Puri and T. Vardanega
This paper reports on experience gained and lessons
learned from an intensive investigation of model-driven engineering
methodology and technology for application to high-integrity
systems. Favourable experimental context was provided for by
ASSERT, a 40-month project partly funded by the EC as part
of the 6th Framework Program. The goodness of fit of the MDE
paradigm for the industrial domain of interest was critically assessed
on a small number of candidate solutions. One of the main
axes of investigation concerned HRT-UML/RCM, an advanced
method and integrated tool for the model-driven development
of embedded real-time software systems. HRT-UML/RCM vastly
leveraged on version 2 of the OMG UML standard and combined
it with the development of a domain-specific metamodel in the
quest to attain correctness-by-construction from the ground up.
The prototype tool developed in the project supported: (1) the
separation of functional (sequential) design from the specification
of real-time and concurrency requirements and properties to
be preserved at run time; and (2) the exploitation of a fully
generative approach to the development, equipped with support
for model-based feasibility analysis and round-trip engineering.
-
Mode-Based Reconfiguration of Critical Software Component Architectures
[p. 1160]
-
E. Borde, G. Haik and L. Pautet
Designing reconfigurable yet critical embedded and
complex systems (i.e. systems composed of different subsystems)
requires making these systems adaptable while guaranteeing
that they operate with respect to predefined safety properties.
When it comes to complex systems, component-based software
engineering methods provide solutions to master this complexity
("divide to conquer"). In addition, architecture description
languages provide solutions to design and analyze critical and
reconfigurable embedded systems. In this paper we propose a
methodology that combines the benefits of these two approaches
by leaning on both AADL and Lightweigth CCM standards.
This methodology is materialized through a complete design
process and an associated framework, MyCCM-HI, dedicated
to designing reconfigurable, critical, and complex embedded
systems.
-
Towards a Formal Semantics for the AADL Behavior Annex
[p. 1166]
-
Z. Yang, K. Hu, D. Ma and L. Pi
AADL is an Architecture Description Language which
describes embedded real-time systems. Behavior annex is an
extension of the dispatch mechanism of AADL execution model.
This paper proposes a formal semantics for the AADL behavior
annex using Timed State Machine (TASM). Firstly, the
semantics of AADL default execution model is given, and then we
formally define some aspects semantics of behavior annex. A
prototype of real-time behavior modeling and verification is
proposed, and finally, a case study will be given to validate the
feasibility.
Keywords- AADL; behavior annex; execution model; TASM
Moderators: W. Schilders, NXP Semiconductors, NL ; L. Silveira, INESC ID / IST - TU Lisbon, PT
-
On the Efficient Reduction of Complete EM Based Parametric Models
[p. 1172]
-
J. Fernandez Villena, G. Ciuprina, D. Ioan and L.M. Silveira
Due to higher integration and increasing frequency based
effects, full Electromagnetic Models (EM) are needed for accurate
prediction of the real behavior of integrated passives and interconnects.
Furthermore, these structures are subject to parametric
effects due to small variations of the geometric and physical
properties of the inherent materials and manufacturing process.
Accuracy requirements lead to huge models, which are expensive
to simulate and this cost is increased when parameters and their
effects are taken into account. This paper presents a complete
procedure for efficient reduction of realistic, hierarchy aware,
EM based parametric models. Knowledge of the structure of
the problem is explicitly exploited using domain partitioning and
novel electromagnetic connector modeling techniques to generate
a hierarchical representation. This enables the efficient use of
block parametric model order reduction techniques to generate
block-wise compressed models that satisfy overall requirements,
and provide accurate approximations of the complete EM behaviour,
which are cheap to evaluate and simulate.
-
Efficient Compression and Handling of Current Source Model Library Waveforms
[p. 1178]
-
S. Hatami, P. Feldmann, S. Abbaspour and M. Pedram
This paper describes a waveform compression
technique suitable for the efficient utilization, storage and
interchange of the emerging current source model (CSM) based
cell libraries. The technique is based on pre-processing of a
collection of voltage/current waveforms for the cells in the library
and then, constructing an orthogonal time-voltage/time-current
waveform basis using singular-value decomposition.
Compression is achieved by representing all waveforms as linear
combination coefficients of adaptive subset of the basis
waveforms. Experimental results indicate that adaptive
waveform representation results in higher compression ratios
than the waveform representation as a function of fixed set of
basis functions. Interpolation and further compression are
obtained by representing the coefficients as simple functions of
various parameters, e.g., input slew, load capacitance, supply
voltage, and temperature. The methods introduced in this paper
are tested and validated on several industrial strength libraries,
with spectacular compression results.
Keywords- Current Source Model; Adaptive Data Compression;
Parameterization; Principal Component; Pre-processing
-
New Simulation Methodology of 3D Surface Roughness Loss for Interconnects Modeling
[p. 1184]
-
Q. Chen and N. Wong
As clock frequencies exceed giga-Hertz, the
extra power loss due to conductor surface roughness in
interconnects and packagings is more evident and thus demands
a proper accounting for accurate prediction of signal
integrity and energy consumption. Existing techniques based
on analytical approximation often suffer from a narrow valid
range, i.e., small or large limit of roughness. In this paper, we
propose a new simulation methodology for surface roughness
loss that is applicable to general surface roughness and a
wide frequency range. The method is based on 3D statistical
modeling of surface roughness and the numerical solution of
scalar wave modeling (SWM) with the method of moments
(MOM). The spectral stochastic collocation method (SSCM)
is applied in association of random surface modeling to avoid
the time-consuming Monte-Carlo (MC) simulation. Comparisons
with existing methods in their respective valid region
then verify the effectiveness of our approach.
-
An Efficient Decoupling Capacitance Optimization Using Piecewise Polynomial Models
[p. 1190]
-
X. Wang, Y. Cai, S. X.-D. Tan, X. Hong and J. Relles
This paper proposes an efficient decoupling (decaps)
capacitance optimization algorithm to reduce the voltage noise
of on-chip power grid networks. The new method is based on
the efficient charge formulation of the decap allocation problem.
But different from the existing work [12], the new method applies
the more accurate piecewise polynomial micromodels to estimate
the voltage noises during the linear programming process. The
resulting method overcomes the over-estimation problem, which
plagues the existing method. The proposed method has the best
of two worlds: it has the efficiency of the charge-based methods
and the accuracy of the sensitivity-based methods. Experimental
results demonstrate that the proposed method leads to the decap
values similar to that of the sensitivity-based methods, which give
the best reported results and are much better than the existing
charge-based method, and at the same time, it enjoys the similar
efficiency of the charge-based method.
Moderators: M. Coppola, STMicroelectronics, FR; L. Fanucci, Pisa U, IT
-
An Automated Flow For Integrating Hardware IP into the Automotive Systems Engineering
Process
[p. 1196]
-
J. Oetjens, R. Goergen, J. Gerlach and W. Nebel
This contribution shows and discusses the requirements
and constraints that an industrial engineering
process defines for the integration of hardware IP into
the system development flow. It describes the developed
strategy for automating the step of making hardware descriptions
available in a MATLAB/Simulink based system
modeling and validation environment. It also explains the
transformation technique on which that strategy is based.
An application of the strategy is shown in terms of an
industrial automotive electronic hardware IP block.
-
Model Based Design Needs High Level Synthesis
[p. 1202]
-
S. Perry
Model Based Design tools based around Simulink
from The MathWorks are a popular technology for the creation
of streaming DSP designs for FPGAs, since they offer the promise
of rapid design exploration through immediate quantitative
feedback of algorithm performance. Current tools typically use a
library of components that reflect an explicit representation of
the underlying FPGA device features. This is undesirable since
the designer is forced to mix implementation and architecture,
and leads to long design cycles and non-portable results. This
paper shows that introducing techniques of high level synthesis
allows a more elegant design at a higher level of abstraction. This
results in fewer components needed for a design which translates
into a faster design cycle, more portable designs and fewer
defects. Pushbutton clock frequencies of up to 500 MHz are
achieved without detailed knowledge of FPGA architectures.
Although the capabilities described are embodied in the DSP
Builder tool from Altera, this paper describes the technology
involved rather than the details of the tools. Four major
technologies are described: a latency-insensitive system
representation, the module level internal representation with
associated transformations, hardware retiming, and lastly a FIR
filter design tool layered on top.
Keywords- Model Based Design; High Level Synthesis; FPGAs;
Technology Mapping; Retiming; FIR Filter Design.
-
EMC-Aware Design on a Microcontroller for Automotive Applications
[p. 1208]
-
P.J. Doriol, Y. Villavicencio, C. Forzan, M. Rotigni, G. Graziosi and D. Pandini
In modern digital ICs, the increasing demand for performance
and throughput requires operating frequencies
of hundreds of megahertz, and in several cases
exceeding the gigahertz range. Following the technology
scaling trends, this request will continue to rise,
thus increasing the electromagnetic interference (EMI)
generated by electronic systems. The enforcement of
strict governmental regulations and international
standards, mainly (but not only) in the automotive domain,
are driving new efforts towards design solutions
for electromagnetic compatibility (EMC). Hence,
EMC/EMI is rapidly becoming a major concern for
high-speed circuit and package designers. The on-chip
power rail noise is one of the most detrimental sources
of electromagnetic (EM) conducted emissions, since it
propagates to the board through the power and ground
I/O pads. In this work we investigate the impact of
power rail noise on EMI, and we show that by limiting
this noise source it is possible to drastically reduce the
conducted emissions. Furthermore, we present a transistor-level
lumped-element simulation model of the
system power distribution network (PDN) that allows
chip, package, and board designers to asses the power
integrity and predict the conducted emissions at critical
chip I/O pads. The experimental results obtained
on an industrial microcontroller for automotive applications
demonstrate the effectiveness of our approach.
-
Semiformal Verification of Temporal Properties in Automotive Hardware Dependent Software
[p. 1214]
-
D. Lettnin, P.K. Nalla, J. Behrend, J. Ruf, J. Gerlach, T. Kropf, W. Rosenstiel, V. Schoenknecht
and S. Reitemeyer
The verification of embedded software has become
an important subject over the last years.
This work presents a new semiformal verification
approach called SofTPaDS. It combines assertion-based
and symbolic simulation approaches for the
verification of embedded software with hardware dependencies.
SofTPaDS shows to be more efficient
than the software model checkers in order to trace
deep state spaces and improves the state coverage
relative to a simulation-based verification tool. We
have successfully applied our approach to an industrial
automotive embedded software.
-
On the Relationship between Stuck-At Fault Coverage and Transition Fault Coverage
[p. 1218]
-
J. Schat
The single stuck-at fault coverage is often seen as a
figure-of-merit also for scan testing according to other
fault models like transition faults, bridging faults,
crosstalk faults, etc. This paper analyzes how far this
assumption is justified.
Since the scan test infrastructure allows reaching
states not reachable in the application mode, and since
faults only detectable in such unreachable states are
not relevant in the application mode, we distinguish
those irrelevant faults from relevant faults, i.e. faults
detectable in the application mode.
We prove that every combinatorial circuit with exactly
100% stuck-at fault coverage has 100% transition
fault test coverage for those faults which are relevant
in the application.
This does not necessarily imply that combinatorial
circuits with almost 100% single-stuckat coverage
automatically have high transition fault coverage. This
is shown in an extreme example of a circuit with
nearly 100% stuck-at coverage, but 0% transition
fault coverage.
-
System-Level Hardware-Based Protection of Memories against Soft-Errors
[p. 1222]
-
V. Gherman, S. Evain, M. Cartron, N. Seymour and Y. Bonhomme
We present a hardware-based approach to improve the
resilience of a computer system against the errors occurred in
the main memory with the help of error detecting and correcting
(EDAC) codes. Checksums are placed in the same type of
memory locations and addressed in the same way as normal
data. Consequently, the checksums are accessible from the
exterior of the main memory just as normal data and this
enables implicit fault-tolerance for interconnection and solid-state
secondary storage sub-systems. A small hardware module
is used to manage the sequential retrieval of checksums
each time the integrity of the data accessed by the processor
sub-system needs to be verified. The proposed approach has
the following properties: (a) it is cost efficient since it can be
used with simple storage and interconnection sub-systems that
do not possess any inherent EDAC mechanism, (b) it allows
on-line modifications of the memory protection levels, and (c)
no modification of the application software is required.
-
A Study of the Single Event Effects Impact on Functional Mapping within Flash-Based FPGAs
[p. 1226]
-
F. Abate, L. Sterpone, M. Violante and F. Lima Kastensmidt
Flash-based FPGAs are increasingly demanded in
safety critical fields, in particular space and avionic ones, due to
their non-volatile configuration memory. Although they are
almost immune to permanent loss of the configuration data, they
are composed of floating gate based switches that can suffer
transient effects if hit by high energetic particles with critical
consequences on the implemented logic. This paper presents a
new way for the analysis of the impact of Single Event Effects in
Flash-based FPGAs. We proposed a new methodology to identify
the most critical switches inside the configuration logic block and
the most redundant and robust configuration selection for each
logic function. The experimental results achieved by fault
injection demonstrated the feasibility of the proposed method
and show that by using the most robust functional mapping it is
possible to enhance the reliability of the entire design with
respect to a not robust ones.
-
Finite Precision Processing in Wireless Applications
[p. 1230]
-
D. Novo, M. Li, B. Bougard, L. Van der Perre and F. Catthoor
Complex signal processing algorithms are often
specified in floating point precision. Thus, a type conversion is
needed when the targeted platform requires fixed-point precision.
In this work we proposed a new method to evaluate the final
impact of finite precision processing in wireless applications. The
latter combines analytical analysis with simulations. This extends
previous work including the effect of the decision-making errors
resulting from quantization. Thereby efficient dimensioning of
the minimum bit-widths that satisfy a given accuracy constraint
can be deployed. The method is validated with two representative
case studies, namely an OFDM inner receiver and a Near-ML
MIMO (Multiple Inputs, Multiple Outputs) detector.
-
A Physical-Location-Aware X-Filling Method for IR-Drop Reduction in At-Speed Scan Test
[p. 1234]
-
W.-W. Hsieh, I.-S. Lin and T. Hwang
IR-drop problem during test mode exacerbates delay
defects and results in false failures. In this paper, we take the X-filling
approach to reduce IR-drop effect during at-speed test.
The main difference between our approach and the previous
X-filling methods [7]-[9] lies in two aspects. The first one is
that we take the spatial information into consideration in our
approach. The second one is how X-filling is performed. We
propose a backward-propagation approach instead of a forward-propagation
approach taken in previous work. The experimental
results show that we have 42.81% reduction for the worst IR-drop
and 45.71% reduction in the average IR-drop as compared
to random fill method.
-
Efficient Reliability Simulation of Analog ICs Including Variability and Time-Varying Stress
[p. 1238]
-
E. Maricau and G. Gielen
Aggressive scaling to nanometer CMOS technologies
causes both analog and digital circuit parameters to degrade over
time due to die-level stress effects (i.e. NBTI, HCI, TDDB, etc).
In addition, failure-time dispersion increases due to increasing
process variability. In this paper an innovative methodology to
simulate analog circuit reliability is presented. Advantages over
current state of the art reliability simulators include, among
others, the possibility to estimate the impact of variability and
the ability to account for the effects of complex time-varying
stress signals. Results show that taking time-varying stress signals
into account provides circuit reliability information not visible
with classic DC-only reliability simulators. Also, variability-aware
reliability simulation results indicate a significant percentage of
early circuit failures compared to failure-time results based on
nominal design only.
-
A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications
[p. 1242]
-
F. Demangel, N. Fau, N. Drabik, F. Charot and C. Wolinski
Low Density Parity Check (LDPC) codes have recently
been chosen in the CCSDS standard for uses in near-earth
applications. The specified code belongs to the class of
Quasi-Cyclic LDPC codes which provide very high data
rates and high reliability. Even if these codes are suited
to high data rate, the complexity of LDPC decoding is a
real challenge for hardware engineers. This paper presents
a generic architecture for a CCSDS LDPC decoder. This
architecture uses the regularity and the parallelism of the
code and a genericity based on an optimized storage of the
data. Two FPGA implementations are proposed: the first
one is low-cost oriented and the second one targets high-speed
decoder.
-
Property Analysis and Design Understanding
[p. 1246]
-
U. Kuehne, D. Grosse and R. Drechsler
Verification is a major issue in circuit and system
design. Formal methods like bounded model checking (BMC) can
guarantee a high quality of the verification. There are several
techniques that can check if a set of formal properties forms a
complete specification of a design. But, in contrast to simulation-based
methods, like random testing, formal verification requires
a detailed knowledge of the design implementation. Finding
the correct set of properties is a tedious and time consuming
process. In this paper, two techniques are presented that provide
automatic support for writing properties in a quality-driven BMC
flow. The first technique can be used to analyze properties in
order to remove redundant assumptions and to separate different
scenarios. The second technique - inverse property checking -
automatically generates valid properties for a given expected
behavior. The techniques are integrated with a coverage check for
BMC. Using the presented techniques, the number of iterations
to obtain full coverage can be reduced, saving time and effort.
-
Test Exploration and Validation Using Transaction Level Models
[p. 1250]
-
M.A. Kochte, C.G. Zoellin, M.E. Imhof, R. Salimi Khaligh, M. Radetzki, H.-J. Wunderlich, S. Di Carlo
and P. Prinetto
The complexity of the test infrastructure and test
strategies in systems-on-chip approaches the complexity of the
functional design space. This paper presents test design space
exploration and validation of test strategies and schedules using
transaction level models (TLMs). Since many aspects of testing
involve the transfer of a significant amount of test stimuli and
responses, the communication-centric view of TLMs suits this
purpose exceptionally well.
Index Terms - Test of systems-on-chip, design-for-test, transaction
level modeling
Organizer: P. Van der Wolf, NXP Semiconductors, NL
Moderators: D. Lattard, CEA-LETI, FR; P. Van der Wolf, NXP Semiconductors, NL
-
Heterogeneous Multi-Core Platform for Consumer Multimedia Applications
[p. 1254]
-
P. Kollig, C. Osborne and T. Henriksson
This paper presents a multi-core SoC architecture for
consumer multimedia applications. The comprehensive
functionality of such multimedia systems is described using the
example of a hybrid TV application. The successful usage of a
heterogeneous multi-core SoC platform is presented and it is
shown how specific challenges such as inter-processor
communication and real-time performance guarantees in
physically centralized memory systems are addressed.
Keywords -component; multiprocessor; TV; physically
centralized memory system
-
Multi-Core for Mobile Phones
[p. 1260]
-
C.H. (K) van Berkel
High-end mobile phones support multiple radio
standards and a rich suite of applications, which involves advanced
radio, audio, video, and graphics processing. The overall
digital workload amounts to nearly 100GOPS, from 4b integer to
24b floating-point operations. With a power budget of only 1W
this inevitably leads to heterogeneous multi-core architectures
with aggressive power management. We review the state-of-the-art
as well as trends.
Organizers/Moderators: A. Jerraya, CEA-LETI, FR; P. Van der Wolf, NXP Semiconductors, NL
-
Strategic Directions towards Multicore Application Specific Computing
[p. 1266]
-
E. Flamand
Modern Systems on Chip strongly rely on highly complex, specialized, mixed hardware software sub
systems to handle processing intensive tasks: 3D graphic, imaging, video, software radio, positioning... Cost and
difficulty of super integration, lack of flexibility, little resource sharing combined with a new class of issues attached to
deep submicron process variability, reliability, open opportunities to revisit more regular, programmable approaches as
an alternative. Will our industry see the emergence of a new generation of standard mega cells that can be assembled
as homogeneous many cores fabrics as an alternative to today's heterogeneous SoCs? We strongly believe that the
answer is yes and in this talk we will go through the many folds of this question.
Moderators: H. Patel, UC Berkeley, US; D. Chen, U of Illinois, Urbana Champaign, US
-
Energy-Efficient Spatially-Adaptive Clustering and Routing in Wireless Sensor Networks
[p. 1267]
-
H. Long, Y. Liu, X. Fan, R.P. Dick and H. Yang
Wireless sensor networks hold the potential to open
new domains to distributed data acquisition. However, low-cost
battery-powered nodes are often used to implement such networks,
resulting in tight energy and communication bandwidth
constraints. Cluster-based data compression and aggregation
helps to reduce communication energy consumption. However,
neglecting to adapt cluster sizes to local network conditions has
limited the efficiency of previous clustering schemes. We have
found that sensor node distances and densities are key factors in
clustering. To the best of our knowledge, this is the first work
taking these factors into consideration when adaptively forming
data aggregation clusters. Compared with previous uniform-size
clustering techniques, the proposed algorithm achieves up to 24%
communication energy savings in uniform density networks and
36% savings in non-uniform density networks.
-
Online Adaptation Policy Design for Grid Sensor Networks with Reconfigurable Embedded Nodes
[p. 1273]
-
V. Subramanian, M. Gilberti and A. Doboli
This paper presents a systematic methodology for
designing the adaptation policies of reconfigurable sensor networks.
The work is motivated by the need to provide efficient
sensing, processing, and networking capabilities under tight
hardware, bandwidth, and energy constraints. The design flow
includes two main steps: generation of alternative design points
representing different performance-cost trade-offs, and finding
the switching rates between the points to achieve effective
adaptation. Experiments studied the scaling of the methods with
the size of the networks, and the effectiveness of the produced
policies with respect to data loss, latency, power consumption,
and buffer space.
-
Defect-Aware Logic Mapping for Nanowire-Based Programmable Logic Arrays via Satisfiability
[p. 1279]
-
Y. Zheng and C. Huang
Programmable logic arrays (PLAs) using selfassembly
nanowire crossbars have shown promising potential for
future nano-scale circuit design. However, due to the density and
size factors of nanowires and molecular switches, the fabrication
fault densities are much higher than those of the conventional
silicon technology, and hence pose greater design challenges.
In this paper, we propose a novel defect-aware logic mapping
framework via Boolean satisfiability (SAT). Compared with the
prior works, our technique considers PLA defects on both input
and output planes at the same time. This synergistic approach
can help solve logic mapping problems with higher defect rates.
The proposed method is universally suitable for various nanoscale
PLAs, including AND/OR, NOR/NOR structures, etc. The
experimental results have shown that it can efficiently solve large
mapping problems at a total defect rate of 20% or even higher.
We further investigate the impact of different defects on PLA
mapping, which helps set up an initial contribution for yield
estimation and utilization of partially-defective PLAs.
-
Debugging of Toffoli Networks
[p. 1284]
-
R. Wille, D. Grosse, S. Frehse, G.W. Dueck and R. Drechsler
Intensive research is performed to find post-CMOS
technologies. A very promising direction based on reversible
logic are quantum computers. While in the domain of reversible
logic synthesis, testing, and verification have been investigated,
debugging of reversible circuits has not yet been considered. The
goal of debugging is to determine gates of an erroneous circuit
that explain the observed incorrect behavior.
In this paper we propose the first approach for automatic
debugging of reversible Toffoli networks. Our method uses
a formulation for the debugging problem based on Boolean
satisfiability. We show the differences to classical (irreversible)
debugging and present theoretical results. These are used to
speed-up the debugging approach as well as to improve the
resulting quality. Our method is able to find and to correct single
errors automatically.
-
Cross-Contamination Avoidance for Droplet Routing in Digital Microfluidic Biochips
[p. 1290]
-
Y. Zhao and K. Chakrabarty
Recent advances in droplet-based digital microfluidics
have enabled biochip devices for DNA sequencing, immunoassays,
clinical chemistry, and protein crystallization. Since
cross-contamination
between droplets of different biomolecules can lead
to erroneous outcomes for bioassays, the avoidance of
cross-contamination
during droplet routing is a key design challenge
for biochips. We propose a droplet-routing method that avoids
cross-contamination in the optimization of droplet flow paths. The
proposed approach targets disjoint droplet routes and minimizes
the number of cells used for droplet routing. We also minimize the
number of wash operations that must be used between successive
routing steps that share unit cells in the microfluidic array. Two
real-life biochemical applications are used to evaluate the proposed
droplet-routing methods.
Moderators: A. Baghdadi, Telecome Bretagne, FR; W. Eberle, IMEC, BE
-
Error Correction in Single-Hop Wireless Sensor Networks - A Case Study
[p. 1296]
-
D. Schmidt, M. Berning and N. Wehn
Energy efficient communication is a key issue in
wireless sensor networks. Common belief is that a multi-hop
configuration is the only viable energy efficient technique. In
this paper we show that the use of forward error correction
techniques in combination with ARQ is a promising alternative.
Exploiting the asymmetry between lightweight sensor nodes and
a more powerful base station even advanced techniques known
from cellular networks can be efficiently applied to sensor
networks. Our investigations are based on realistic power models
and real measurements and, thus, consider all side-effects. This
is to the best of our knowledge the first investigation of advanced
forward error correction techniques in sensor networks which is
based on real experiments.
-
Design of an Application-Specific Instruction Set Processor for High-Throughput and Scalable FFT
[p. 1302]
-
X. Guan, H. Lin and Y. Fei
Various Orthogonal Frequency Division Multiplexing
(OFDM)-based wireless communication standards
have raised more stringent requirements on throughput
and flexibility of Fast Fourier Transformation (FFT), a
kernel data transformation task in communication systems.
Application-specific instruction set processor (ASIP) has
emerged as a promising solution to meet these requirements.
In this paper, we propose a novel ASIP design tailored for
FFT computation. We reconstruct the FFT computation flow
into a scalable array structure based on an 8-point butterfly
unit (BU). Any-point FFT computation can be carried out in
the array structure which can easily expand along both the
horizontal and vertical dimensions. We incorporate custom
register files to reduce memory access. The data address for
custom registers in each FFT stage is changed accordingly,
and we derive a regular address changing (AC) rule. With the
microarchitecture modifications, we extend the instruction
set with three custom instructions correspondingly. Our FFT
ASIP implementation achieves great performance improvement
over the standard FFT software implementation, one
TI DSP processor, and one commercial Xtensa ASIP, with
the data throughput improvement as 866.5X, 5.9X, 2.3X,
respectively. Meanwhile, the area and power consumption
overhead of the custom hardware is negligible.
-
A Novel LDPC Decoder for DVB-S2 IP
[p. 1308]
-
S. Mueller, M. Schreger, M. Kabutz, M. Alles, F. Kienle and N. Wehn
In this paper a programmable Forward Error Correction
(FEC) IP for a DVB-S2 receiver is presented. It is composed
of a Low-Density Parity Check (LDPC), a Bose-Chaudhuri-Hoquenghem
(BCH) decoder, and pre- and postprocessing units.
Special emphasis is put on LDPC decoding, since it accounts for
the most complexity of the IP core by far.
We propose a highly efficient LDPC decoder which applies
Gauss-Seidel decoding. In contrast to previous publications,
we show in detail how to solve the well known problem of
superpositions of permutation matrices. The enhanced convergence
speed of Gauss-Seidel decoding is used to reduce area
and power consumption. Furthermore, we propose a modified
version of the λ-Min algorithm which allows to further decrease
the memory requirements of the decoder by compressing the
extrinsic information.
Compared to the latest published DVB-S2 LDPC decoders,
we could reduce the clock frequency by 40% and the memory
consumption by 16%, yielding large energy and area savings
while offering the same throughput.
Index Terms - Forward Error Correction, Soft Decision Decoding,
LDPC, DVB-S2, Check Node approximation.
-
A Flexible Floating-Point Wavelet Transform and Wavelet Packet Processor
[p. 1314]
-
A. Guntoro and M. Glesner
The richness of wavelet transformation is known in
many fields. There exist different classes of wavelet filters that can
be used depending on the application. In this paper, we propose
an IEEE 754 floating-point lifting-based wavelet processor that
can perform various forward and inverse Discrete Wavelet
Transforms (DWTs) and Discrete Wavelet Packets (DWPs). Our
architecture is based on processing elements that can perform
either prediction or update on a continuous data stream in
every two clock cycles. We also consider the normalization
step that takes place at the end of the forward DWT/DWP
or at the beginning of the inverse DWT/DWP. To cope with
different wavelet filters, we feature a multi-context configuration
to select among various DWTs/DWPs. Different memory sizes
and multi-level transformations are supported. For the 32-bit
implementation, the estimated area of the proposed processor
with 2x512 words memory and 8 PEs in a 0.18-μm process is
3.7 mm square and the estimated operating speed is 353 MHz.
Moderators: F. Fummi, Verona U, IT; M. Zwolinski, Southampton U, UK
-
On Hierarchical Statistical Static Timing Analysis
[p. 1320]
-
B. Li, N. Chen, M. Schmidt, W. Schneider and U. Schlichtmann
Statistical static timing analysis deals with the
increasing variations in manufacturing processes to reduce the
pessimism in the worst case timing analysis. Because of the
correlation between delays of circuit components, timing model
generation and hierarchical timing analysis face more challenges
than in static timing analysis. In this paper, a novel method
to generate timing models for combinational circuits considering
variations is proposed. The resulting timing models have accurate
input-output delays and are about 80% smaller than the original
circuits. Additionally, an accurate hierarchical timing analysis
method at design level using pre-characterized timing models
is proposed. This method incorporates the correlation between
modules by replacing independent random variables to improve
timing accuracy. Experimental results show that the correlation
between modules strongly affects the delay distribution of the
hierarchical design and the proposed method has good accuracy
compared with Monte Carlo simulation, but is faster by three
orders of magnitude.
-
Increasing the Accuracy of SAT-Based Debugging
[p. 1326]
-
A. Suelflow, G. Fey, C. Braunsteine, U. Kuehne and R. Drechsler
Equivalence checking and property checking are
powerful techniques to detect error traces. Debugging these traces
is a time consuming design task where automation provides help.
In particular, debugging based on Boolean Satisfiability (SAT)
has been shown to be quite efficient. Given some error traces,
the algorithm returns fault candidates. But using random error
traces cannot ensure that a fault candidate is sufficient to explain
all erroneous behaviors.
Our approach provides a more accurate diagnosis by iterating
the generation of counterexamples and debugging. This increases
the accuracy of the debugging result and yields more valuable
counterexamples. As a consequence less time consuming manual
iterations between verification and debugging are required - thus
the debugging productivity increases.
-
GCS: High-Performance Gate-Level Simulation with GP-GPUs
[p. 1332]
-
D. Chatterjee, A. DeOrio and V. Bertacco
In recent years, the verification of digital designs has become one of
the most challenging, time consuming and critical tasks in the entire
hardware development process. Within this area, the vast majority
of the verification effort in industry relies on logic simulation
tools. However, logic simulators deliver limited performance when
faced with vastly complex modern systems, especially synthesized
netlists. The consequences are poor design coverage, delayed product
releases and bugs that escape into silicon. Thus, we developed
a novel GPU-accelerated logic simulator, called GCS, optimized
for large structural netlists. By leveraging the vast parallelism offered
by GP-GPUs and a novel netlist balancing algorithm tuned for
the target architecture, we can attain an order-of-magnitude performance
improvement on average over commercial logic simulators,
and simulate large industrial-size designs, such as the OpenSPARC
processor core design.
-
Trace Signal Selection for Visibility Enhancement in Post-Silicon Validation
[p. 1338]
-
X. Liu and Q. Xu
Today's complex integrated circuit designs increasingly rely on
post-silicon validation to eliminate bugs that escape from presilicon
verification. One effective silicon debug technique is to
monitor and trace the behaviors of the circuit during its normal
operation. However, designers can only afford to trace a small
number of signals in the design due to the associated overhead.
Selecting which signals to trace is therefore a crucial issue for
the effectiveness of this technique. This paper proposes an automated
trace signal selection strategy that is able to dramatically
enhance the visibility in post-silicon validation. Experimental
results on benchmark circuits show that the proposed technique
is more effective than existing solutions.
Moderators: J. Schloeffel, Mentor Graphics, DE; G. Dintale, LIRMM, FR
-
A New Design-for-Test Technique for SRAM Core-Cell Stability Faults
[p. 1344]
-
A. Ney, L. Dilillo, P. Girard, S. Pravossoudovitch, A. Virazel, M. Bastian and V. Gouin
Core-cell stability represents the ability of the core-cell
to keep the stored data. With the rapid development of
semiconductor memories, their test is becoming a major concern
in VDSM technologies. It provides information about the SRAM
design reliability, and its effectiveness is therefore mandatory for
safety applications. Existing core-cell stability Design-for-Test
(DfT) techniques consist in controlling the voltage levels of bit
lines to apply a weak write stress on the core-cell under test. If
the core-cell is weak, the weak write stress induces the faulty
swap of the core-cell. However, these solutions are costly in terms
of area and test application time, and generally require
modifications of critical parts of the SRAM (core-cell array
and/or the structure generating the internal auto-timing). In this
paper, we present a new DfT technique for stability fault
detection. It consists in modulating the word line activation in
order to perform an adjustable weak write stress on the targeted
core-cell for stability fault detection. Compared to existing DfT
solutions, the proposed technique offers many advantages:
programmability, low area overhead, low test application time.
Moreover, it does not require any modification of critical parts of
the SRAM.
-
Test Cost Reduction for Multiple-Voltage Designs with Bridge Defects through Gate-Sizing
[p. 1349]
-
S. Khursheed, B.M. Al-Hashimi and P. Harrod
Multiple-voltage is an effective dynamic power reduction
design technique. Recent research has shown that testing
for resistive bridging faults in such designs requires more than
one voltage setting for 100% defect coverage; however switching
between several supply voltage settings has a detrimental impact
on the overall cost of test. This paper proposes an effective
Gate Sizing technique for reducing test cost of multi-Vdd designs
with bridge defects. Using synthesized ISCAS benchmarks and a
parametric fault model, experimental results show that for all the
circuits, the proposed technique achieves 100% defect coverage
at a single Vdd setting; in addition it has a lower overhead than
the recently proposed Test Point Insertion technique in terms of
timing, area and power.
Index Terms - Gate Sizing, Test Cost, Resistive Bridging Faults,
Multiple-Vdd designs, Design for Testability
-
A Diagnosis Algorithm for Extreme Space Compaction
[p. 1355]
-
S. Holst and H.-J. Wunderlich
During volume testing, test application time, test data
volume and high performance automatic test equipment (ATE)
are the major cost factors. Embedded testing including builtin
self-test (BIST) and multi-site testing are quite effective cost
reduction techniques which may make diagnosis more complex.
This paper presents a test response compaction scheme and a
corresponding diagnosis algorithm which are especially suited
for BIST and multi-site testing. The experimental results on
industrial designs show, that test time and response data volume
reduces significantly and the diagnostic resolution even improves
with this scheme. A comparison with X-Compact indicates, that
simple parity information provides higher diagnostic resolution
per response data bit than more complex signatures.
Keywords - Diagnosis, Embedded diagnosis, Multi-site test, Compaction,
Design-for-test
Moderators: C. Haubelt, Erlangen-Nuremberg U, DE; D. Gajski, UC Irvine, US
-
Thermal-Aware Memory Mapping in 3D Designs
[p. 1361]
-
A.-C. Hsieh and T. Hwang
DRAM is usually used as main memory for program
execution. The thermal behavior of a memory block in a 3D SIP is affected
not only by the power behavior but also the heat dissipating ability of
that block. The power behavior of a block is related to the applications
run on the system while the heat dissipating ability is determined
by the number of tier and the position the block locates. Therefore,
a thermal-aware memory allocator should consider the following two
points. First, allocator should consider not only the power behavior
of a memory block but also the physical location during memory
mapping, second, the changing temperature of a physical block during
execution of programs. In this paper, we will propose a memory mapping
algorithm taking into consideration the above-mentioned two points. Our
technique can be classified as static thermal management to be applied
to embedded software designs. Experiments show that our method can
reduce temperature of memory system by 17.2.C as compared to a
straightforward mapping in the best case, and 13.4.C in average.
-
Static Analysis to Mitigate Soft Errors in Register Files
[p. 1367]
-
J. Lee and A. Shrivastava
With continuous technology scaling, soft errors are
becoming an increasingly important design concern even for
earth-bound applications. While compiler approaches have the
potential to mitigate the effect of soft errors with minimal
runtime overheads, static vulnerability estimation - an essential
part of compiler approaches - is lacking due to its inherent
complexity. This paper presents a static analysis approach for
Register File (RF) vulnerability estimation. We decompose the
vulnerability of a register into intrinsic and conditional basic-block
vulnerabilities. This decomposition allows us to develop
a fast, yet reasonably accurate, linear equation-based RF vulnerability
estimation mechanism. We demonstrate its practical
application to compiler optimizations. Our experimental results
on benchmarks from MiBench suite indicate that not only our
static RF vulnerability estimation is fast and accurate, but also
compiler optimizations enabled by our static estimation can
achieve very cost-effective protection of register files against soft
errors.
-
Using Dynamic Compilation for Continuing Execution under Reduced Memory Availability
[p. 1373]
-
O. Ozturk and M. Kandemir
This paper explores the use of dynamic compilation
for continuing execution even if one or more of the memory
banks used by an application become temporarily unavailable
(but their contents are preserved), that is, the number of
memory banks available to the application varies at runtime.
We implemented the proposed dynamic compilation approach
using a code instrumentation system and performed experiments
with 12 embedded benchmark codes. The results collected so
far are very encouraging and indicate that, even when all
the overheads incurred by dynamic compilation are included,
the proposed approach still brings significant benefits over an
alternate approach that suspends application execution when
there is a reduction in memory bank availability and resumes
later when all the banks are up and running.
Moderators: M. Ortmanns, Ulm U, DE; C. Grimm, TU Vienna, AT
-
A Design Methodology for Fully Reconfigurable Delta-Sigma Data Converters
[p. 1379]
-
Y. Ke, J. Craninkx and G. Gielen
This paper presents a design methodology for fully
reconfigurable low-voltage Delta-Sigma converters as for
instance used in next-generation wireless applications. The design
methodology first finds the power-optimized noise transfer
functions for the different standards at system level and then
translates them into optimal granularities of programmability
and circuit parameters such as resistance and capacitance values
for the integrators. Reconfiguration is done in the passive
component arrays, modulator orders, number of quantizer bits
and transconductance for optimal power consumption. This gives
the design the best trade-off between power and performance for
every configuration mode.
-
Optimal Sizing of Configurable Devices to Reduce Variability in Integrated Circuits
[p. 1385]
-
P. Wilson and R. Wilcock
This paper describes a systematic approach that
facilitates yield improvement of integrated circuits at the
post-manufacture stage. A new Configurable Analogue
Transistor (CAT) structure is presented that allows the
adjustment of devices after manufacture. The technique
enables both performance and yield to be improved as
part of the normal test process. The optimal sizing of the
inserted CAT devices is crucial to ensure the greatest
improvement in yield and this paper considers this
challenge in detail. An analysis and description of the
underlying theory of the sizing problem is given along
with examples of incorrect sizing. Guidelines to achieve
optimal CAT sizing are proposed, and results are
provided to demonstrate the overall effectiveness of the
CAT approach.
-
An Automated Design Flow for Vibration-Based Energy Harvester Systems
[p. 1391]
-
L. Wang, T.J. Kazmierski, B.M. Al-Hashimi, S.P. Beeby and D. Zhu
This paper proposes, for the first time, an automated
energy harvester design flow which is based on a single HDL
software platform that can be used to model, simulate, configure
and optimise energy harvester systems. A demonstrator prototype
incorporating an electromagnetic mechanical-vibration-based
micro-generator and a limited number of library models
has been developed and a design case study has been carried
out. Experimental measurements have validated the simulation
results which show that the outcome from the design flow can
improve the energy harvesting efficiency by 75%.
-
Enhanced Design of Filterless Class-D Audio Amplifier
[p. 1397]
-
C.W. Lin, B.-S. Hsieh and Y.C. Lin
In this work, we propose an enhanced design method
for filterless class-D audio amplifier based on multi-level
architecture. The multilevel technique consists of a
multilevel converter and a time division adder followed by
modulator. In this method, the modulated signal is
arranged into several time divisions and then be
integrated into a binary numeric. After that, the binary
numeric is encoded to be a set of parallel control signal
for multilevel converter. The multilevel converter will
deliver multilevel signal to loudspeaker instead of
conventional two-level signals. Consequently, improve the
total-harmonic-distortion (THD) and signal-noise-ratio
(SNR) significantly without sacrificing power efficiency.
Moreover, we can apply the proposed method to many
class-D amplifier designs simply insert a time division
adder behind modulator and replace output stage with
multilevel converter.
Organizer: A. Jerraya, CEA-LETI, FR
Moderator: R. Ernst, TU Braunschweig, DE
Panelists: N. Topham, D. Pulley, M. Harrand, J. Goodacre, G. Martin and Y. Tanurhan
-
-
Multicore solution is a need imposed by both technology and market constraints to
replace a large part of FPGA and ASIC products for the embedded system market. So
far, many solutions featuring a variety of ad-hoc hardware and software multicore
architectures have been developed mainly by startups. No clear winning solution has
emerged so far to conquer a significant part of this fast growing market. The panel will
present the most promising products and solutions and discuss the winning strategy to
market.
Moderators: T. Ishihara, Kyushu U, JP ; B. Mishra, Southampton U, UK
-
Effectiveness of Adaptive Supply Voltage and Body Bias as Post-Silicon Variability Compensation
Techniques for Full-Swing and Low-Swing On-Chip Communication Channels
[p. 1404]
-
G. Paci, D. Bertozzi and L. Benini
Adaptive body bias (ABB) and adaptive supply voltage (ASV)
have been showed to be effective methods for post-silicon tuning of
circuit properties to reduce variability. While their properties have
been compared on generic combinational circuits or microprocessor
circuit sub-blocks, the advent of multi-core systems is bringing
a new application domain forefront. Global interconnects are
evolving to complex communication channels with drivers and receivers,
in an attempt to mitigate the effects of reverse scaling and
reduce power. The characterization of the performance spread
of these links and the exploration of effective and power-aware
compensation techniques for them is becoming a key design issue.
This work compares the variability compensation efficiency
of ABB vs ASV when put at work in two representative link architectures
of today's ICs: a traditional full-swing interconnect and
a low-swing signaling scheme for low-power communication. We
provide guidelines for the post-silicon variability compensation of
these communication channels.
-
Dynamic Thermal Management in 3D Multicore Architectures
[p. 1410]
-
A.K. Coskun, J. Ayala, D. Atienza, T. Simunic-Rosing and J. Leblebici
Technology scaling has caused the feature sizes to shrink continuously,
whereas interconnects, unlike transistors, have not followed the same
trend. Designing 3D stack architectures is a recently proposed approach
to overcome the power consumption and delay problems associated with
the interconnects by reducing the length of the wires going across the
chip. However, 3D integration introduces serious thermal challenges due
to the high power density resulting from placing computational units
on top of each other. In this work, we first investigate how the existing
thermal management, power management and job scheduling policies
affect the thermal behavior in 3D chips. We then propose a dynamic
thermally-aware job scheduling technique for 3D systems to reduce the
thermal problems at very low performance cost. Our approach can
also be integrated with power management policies to reduce energy
consumption while avoiding the thermal hot spots and large temperature
variations.
-
Energy Minimization for Real-Time Systems with Non-Convex and Discrete Operation Modes
[p. 1416]
-
F. Dabiri, A. Vahdatpour, M. Potkonjak and M. Sarrafzadeh
We present an optimal methodology for dynamic
voltage scheduling problem in the presence of realistic assumption
such as leakage-power and intra-task overheads. Our contribution
is an optimal algorithm for energy minimization that
concurrently assumes the presence of (1) non-convex energy-speed
models as opposed to previously studied convex models, (2) discrete
set of operational modes (voltages) and (3) intra-task energy
and delay overhead. We tested our algorithm on MediaBench
and task sets used in previous papers. Our simulation results
show an average of 22% improvement in energy reduction in
comparison with optimal algorithms for convex models without
switching overhead and on average of 24% with consideration for
energy and delay overheads. This analysis lays the groundwork
for improving functionality in CAD design through non-convex
techniques for discrete models.
-
Exploiting Narrow-Width Values for Thermal-Aware Register File Designs
[p. 1422]
-
S. Wang, J. Hu, S.G. Ziavras and S. W. Chung
Localized heating-up creates thermal hotspots
across the chip, with the integer register file ranked as the hottest
unit in high-performance microprocessors. In this paper, we
perform a detailed study on the thermal behavior of a low-power
value-aware register file (VARF) that is subjected to internal
fine-grain hotspots. To further optimize its thermal behavior,
we propose and evaluate three thermal-aware control schemes,
thermal sensor (TS), access counter (AC), and register-id (ID)
based, to balance the access activity and thus the temperature
across different partitions in the VARF. The simulation results
using SPEC CINT2000 benchmarks show that the register-id
controlled VARF (ID-VARF) scheme achieves optimized thermal
behavior at minimum cost as compared to the other schemes. We
further evaluate the performance impact of the thermal-aware
VARF design with the dynamic thermal management (DTM). The
experimental results show that the ID-VARF can improve the
performance by 26.1% and 7.2% over the conventional register
file and the original VARF design, respectively.
Moderators: K. Goossens, NXP Semiconductors and TU Delft, NL; C. Bouganis, Imperial College London, UK
-
Visual Quality Analysis for Dynamic Backlight Scaling in LCD Systems
[p. 1428]
-
A. Bartolini, M. Ruggiero and L. Benini
With the trend toward high-quality large form
factor displays on high-end handhelds, LCD backlight accounts
for a significant and increasing percentage of the total energy
budget. Substantial energy savings can be achieved by dynamically
adapting backlight intensity levels while compensating
for the ensuing visual quality degradation with image pixel
transformations. Several compensation techniques have been
recently developed to this purpose, but none of them has been
fully characterized in terms of quality losses considering jointly
the non-idealities present in a real embedded video chain and
the peculiar characteristics of the human visual system (HVS).
We have developed a quality analysis framework based on an
accurate embedded visualization system model and HVS-aware
metrics. We use it to assess the visual quality performance of
existing dynamic backlight scaling (DBS) solutions. Experimental
results show that none of the DBS techniques available today is
fully capable of keeping quality loss under control, and that there
is significant room for improvement in this direction.
-
A Parallel Approach for High Performance Hardware Design of Intra Prediction in H.264/AVC
Video Codec
[p. 1434]
-
M. Shafique, L. Bauer and J. Henkel
The H.264/AVC Intra Frame Codec (i.e. all frames are
coded as I-frames) targets high-resolution/high-end encoding applications
(e.g. digital cinema and high quality archiving etc.),
providing much better compression efficiency at lower computational
complexity compared to MJPEG2000. Moreover, in case of
video coding of very high motion scenes, the number of Intra
Macroblocks is dominant. Intra Prediction is a compute intensive
and memory-critical part that consumes 80% of the computation
time of the entire Intra Compression process when executing the
H.264 encoder on MIPS processor [13]. We therefore present a
novel hardware for H.264 Intra Prediction that processes all the
prediction modes in parallel inside one integrated module (i.e.
mode-level parallelism) enabling us to exploit the full space of optimization.
It exhibits a group-based write-back scheme to reduce
the memory transfers in order to facilitate the fast mode-decision
schemes. Our Luma 4x4 hardware is 3.6x, 5.2x, and 5.5x faster
than state-of-the-art approaches [13], QS0 [14], and [15], respectively.
Our results show that processing Luma 16x16, Chroma
8x8, and Luma 4x4 with the proposed approach is 7.2x, 6.5x, and
1.8x faster (while giving an energy saving of 60%, 80%, and
74%) when compared with Dedicated Module Approach [13]
(each prediction mode is processed with its independent hardware
module i.e. a typical ASIC style for Intra Prediction). We
get an area saving of 58% for Luma 4x4 hardware.
-
Efficient Constant-Time Entropy Decoding for H.264
[p. 1440]
-
N. Iqbal and J. Henkel
Diverse approaches to parallel implementation of
H.264 have been proposed; however, they all share a common
problem. The entropy decoder in H.264 remains mapped on a
single processing element (PE). Due to the inherently sequential
and context-adaptive nature of the entropy decoder, it cannot be
parallelized. This renders a bottleneck to the performance of the
entire decoding process. Depending on the type of the processing
core and the video bit-rate, the performance of the entire decoding
process is subject to the process of entropy decoding. It is,
therefore, needful to research and implement new algorithmic
solutions to compensate for this bottleneck, and thereby make
optimal use of parallel implementation of H.264 decoder on
mainstream multi-core systems.
This paper presents a new CAVLC decoding method which is
de-rived by constructing custom CAVLC decoding tables using
"table grouping". Compared to the conventional [5] "sequential
table look-up" method, which requires multiple memory accesses.
Our proposed method accesses the custom tables only
once for the decoding of any symbol. Moreover, in our proposed
method, the symbol decoding time does not depend on the symbol
length and it is constant for each symbol, resulting in a nearly
linear increase in computational complexity with increase in
video fidelity as compared to an non linear increase in earlier
proposed methods. Experimental results show that our proposed
algorithm features up to 7x higher performance and 83% less
memory accesses compared to conventional methods. We compare
to three commonly used, state-of-the-art CAVLC algorithms,
such as table look-up by sequential search [5], table
look-up by binary search [9], and "Moon's method". [16] .
-
Predictive Models for Multimedia Applications Power Consumption Based on Use-Case and OS Level
Analysis
[p. 1446]
-
P. Bellasi, W. Fornaciari and D. Siorpaes
Power management at any abstraction level is a key
issue for many mobile multimedia and embedded applications.
In this paper a design workflow to generate system-level power
models will be presented, tailored to support quantitative runtime
power optimization policies to be implemented within
an operating system. The approach we followed to derive
power models is strongly use-case oriented. Starting from a
comprehensive general and accurate model of a representative
architecture for embedded applications (including a multi core
MPSoC, accelerators, interfaces and peripherals), a methodology
to derive compact models is presented, based upon the distinctive
characteristics of the selected use cases. The methodology to
generate such model, whose exploitation is foreseen within a
power manager working at the OS level, is the focus of the paper.
The value and accuracy of the approach is quantitatively and
statistically justified through extensive experiments carried out
on a development board designed for multimedia applications.
Moderators: S. Nowick, Columbia U, US ; F. Fummi, Verona U, IT
-
Algebraic Techniques to Enhance Common Sub-Expression Elimination for Polynomial System
Synthesis
[p. 1452]
-
S. Gopalakrishnan and P. Kalla
Common sub-expression elimination (CSE)
serves as a useful optimization technique in the synthesis of
arithmetic datapaths described at RTL. However, CSE has
a limited potential for optimization when many common
sub-expressions are not exposed. Given a suitable transformation
of the polynomial system representation, which exposes many
common sub-expressions, subsequent CSE can
offer a higher degree of optimization. The objective of this
paper is to develop algebraic techniques that perform such
a transformation, and present a methodology to integrate
it with CSE to further enhance the potential for optimization.
In our experiments, we show that this integrated approach
outperforms conventional methods in deriving area-efficient
hardware implementations of polynomial systems.
-
Sequential Logic Synthesis Using Symbolic Bi-Decompsition
[p. 1458]
-
V. Kravets and A. Mishchenko
This paper uses under-approximation of unreachable states
of a design to derive incomplete specification of combinational
logic. The resulting incompletely-specified functions
are decomposed to enhance the quality of technology-dependent
synthesis. The decomposition choices are computed
implicitly using novel formulation of symbolic bi-decomposition
that is applied recursively to decompose
logic in terms of simple primitives. The ability of BDDs to
represent compactly certain exponentially large combinatorial
sets helps us to implicitly enumerate and explore variety
of decomposition choices improving quality of synthesized
circuits. Benefits of the symbolic technique are demonstrated
in sequential synthesis of publicly available benchmarks
as well as on the realistic industrial designs.
-
On Decomposing Boolean Functions via Extended Cofactoring
[p. 1464]
-
A. Bernasconi, V. Ciriani, G. Trucco and T. Villa
We investigate restructuring techniques based on
decomposition/factorization, with the objective to move critical
signals toward the output while minimizing area. A specific
application is synthesis for minimum switching activity (or high
performance), with minimum area penalty, where decompositions
with respect to specific critical variables are needed (the ones of
highest switching activity for example). In this paper we describe
new types of factorization that extend Shannon cofactoring and
are based on projection functions that change the Hamming
distance of the original minterms and on appropriate don't care
sets, to favor logic minimization of the component blocks. We
define two new general forms of decomposition that are special
cases of the pattern F = G(H(X),Y). The related implementations,
called P-Circuits, show experimentally promising results in area
with respect to Shannon cofactoring.
-
Register Placement for High-Performance Circuits
[p. 1470]
-
M.-F. Chiang, T. Okamoto and T. Yoshimura
In modern sub-micron design, achieving low-skew
clock distributions is facing challenges for high-performance
circuits. Symmetric global clock distribution and clock tree
synthesis (CTS) for local clock optimization are used so far,
but new methodologies are necessary as the technology node
advances. In this paper, we study the register placement problem
which is a key component of local clock optimization for
high-performance
circuit design along with local clock distribution.
We formulate it as a minimum weighted maximum independent
set problem on a weighted conflict graph and propose a novel
efficient two-stage heuristic to solve it. To reduce the graph size,
techniques based on register flipping and Manhattan circle are
also presented. Experiments show that our heuristic can place all
registers without overlaps and achieve significant improvement
on the total and maximal register movement.
Moderators: J. Vial, Infineon, FR; T. Yoneda, Nara Institute of Science and Technology, JP
-
Scalable Adaptive Scan (SAS)
[p. 1476]
-
A. Chandra, R. Kapur and Y. Kanzawa
Scan compression has emerged as the most successful
solution to solve the problem of rising manufacturing test
cost. Compression technology is not hierarchical in
nature. Hierarchical implementations need test access
mechanisms that keep the isolation between the different
tests applied through the different compressors and
decompressors. In this paper we discuss a test access
mechanism for Adaptive Scan that addresses the problem
of reducing test data and test application time in a
hierarchical and low pin count environment. An active test
access mechanism is used that becomes part of the
compression schemes and unifies the test data for multiple
CODEC implementations. Thus, allowing for hierarchical
DFT implementations with flat ATPG.
-
LFSR-Based Test-Data Compression with Self-Stoppable Seeds
[p. 1482]
-
M. Koutsoupia, E. Kalligeros, X. Kavousianos and D. Nikolos
The main disadvantage of LFSR-based compression is
that it should be usually combined with a constrained ATPG
process, and, as a result, it cannot be effectively applied to IP
cores of unknown structure. In this paper, a new LFSR-based
compression approach that overcomes this problem is proposed.
The proposed method allows each LFSR seed to encode as many
slices as possible. For achieving this, a special purpose slice,
called stop-slice, that indicates the end of a seed's usage is encoded
as the last slice of each seed. Thus, the seeds include by
construction the information of where they should stop and, for
that reason, we call them self-stoppable. A stop-slice generation
procedure is proposed that exploits the inherent test set characteristics
and generates stop slices which impose minimum compression
overhead. Moreover, the architecture for implementing
the proposed technique requires negligible additional hardware
overhead compared to the standard LFSR-based architecture.
The proposed technique is also accompanied by a seed calculation
algorithm that tries to minimize the number of calculated seeds.
-
Seed Selection in LFSR-Reseeding-Based Test Compression for the Detection of Small-Delay Defects
[p. 1488]
-
M. Yilmaz and K. Chakrabarty
Test data volume and test application time are major concerns
for large industrial circuits. In recent years, many compression
techniques have been proposed and evaluated using industrial designs.
However, these methods do not target sequence- or timing-dependent
failures while compressing the test patterns. Timingrelated
failures in high-performance integrated circuits are now
increasingly dominated by small-delay defects (SDDs). We present
a SDD-aware seed-selection technique for LFSR-reseeding-based
test compression. Experimental results show that significant
test-pattern-quality increase can be achieved when seeds are selected
to target SDDs.
-
A Generic Framework for Scan Capture Power Reduction in Fixed-Length Symbol-Based Test
Compression Environment
[p. 1494]
-
X. Liu and Q. Xu
Growing test data volume and overtesting caused by excessive
scan capture power are two of the major concerns for the industry
when testing large integrated circuits. Various test data
compression (TDC) schemes and low-power X-filling techniques
were proposed to address the above problems. These
methods, however, exploit the very same "don't-care" bits in
the test cubes to achieve different objectives and hence may
contradict to each other. In this work, we propose a generic
framework for reducing scan capture power in test compression
environment. Using the entropy of the test set to measure
the impact of capture power-aware X-filling on the potential
test compression ratio, the proposed holistic solution is able to
keep capture power under a safe limit with little compression
ratio loss for any fixed-length symbol-based TDC method. Experimental
results on benchmark circuits demonstrate the efficacy
of the proposed approach.
Moderators: A. Gerstlauer, U of Texas at Austin, US; D. Borrione, TIMA Laboratory, FR
-
Correct-by-Construction Generation of Device Drivers Based on RTL Testbenches
[p. 1500]
-
N. Bombieri, F. Fummi, G. Pravadelli and S. Vinco
The generation of device drivers is a very time consuming
and error prone activity. All the strategies proposed up
to now to simplify this operation require a manual, even formal,
specification of the device driver functionalities. In the system-level
design, IP functionalities are tested by using testbenches,
implemented to contain the communication protocols to correctly
interact with the device. The aim of this paper is to present a
methodology to automatically generate device drivers from the
testbench of any RTL IP. The only manual step required is to
tag the states corresponding to the different device functionalities.
The Extended Finite State Machines (EFSMs) are then used to
create a correct-by-construction two-level device driver: the lower
level deals with architectural choices, while the higher one is
derived from the EFSMs and it implements the communication
protocols. The effectiveness of this methodology has been proved
by applying it to a platform provided by STMicroelectronics.
-
Buffer Minimization of Real-Time Streaming Applications Scheduling on Hybrid CPU/FPGA
Architectures
[p. 1506]
-
J. Zhu, I. Sander and A. Jantsch
We address the problem of real-time streaming applications
scheduling on hybrid CPU/FPGA architectures.
The main contribution is a two-step approach to minimize
the buffer requirement for streaming applications
with throughput guarantees. A novel declarative way of
constraint based scheduling for real-time hybrid SW/HW
systems is proposed, while the application throughput is
guaranteed by periodic phases in execution. We use a
voice-band modem application to exemplify the scheduling
capabilities of our method. The experimental results
show the advantages of our techniques in both less buffer
requirement and higher throughput guarantees compared
to the traditional PAPS method.
-
A Formal Approach for Specification-Driven AMS Behavioral Model Generation
[p. 1512]
-
S. Mukherjee, A. Ain, S.K. Panda, R. Mukhopadhyay and P. Dasgupta
Behavioral models for analog and mixed signal
(AMS) designs are developed at various levels of abstraction,
using various types of languages, to cater to a wide variety of requirements,
ranging from verification, design space exploration,
test generation, and application demonstration. In this paper
we present a high-level formalism for capturing the AMS design
intent from the specification and present techniques for automatic
generation of AMS behavioral models. The proposed formalism
is a language independent one, yet the design intent is modeled
at a level of abstraction which enables easy translation into
common modeling standards. We demonstrate the translation
into VerilogA and SPICE, which are fundamentally different
standards for behavioral modeling. The proposed approach is
demonstrated using a family of Low Dropout Regulators (LDO)
as the reference.
-
SC-DEVS: An Efficient Systemc Extension for the DEVS Model of Computation
[p. 1518]
-
F. Madlener, H.G. Molter and S.A. Huss
This paper describes a systematic approach to integrate
the Discrete Event Specified System (DEVS) methodology
into SystemC. It thus combines Model of Computation
(MoC) specific properties and the features of an
advanced SystemC environment. The execution of abstract
system level DEVS models is comparable to pure SystemC
models and is significantly faster compared to other DEVS
environments. Thus, system level models based on abstract
MoCs may easily be executed in a SystemC environment.
The proposed integration is realized as a non-introspective
extension to the SystemC 2.2 kernel. The DEVS models
are implemented on an additional software layer above
the SystemC simulation kernel. Our approach may be
used simultaneously with other layered extensions, e.g.,
SystemC-AMS or TLM.
Moderators: P. Lysaght, Xilinx, US; K. Bertels, TU Delft, NL
-
Exploiting Clock Skew Scheduling for FPGA
[p. 1524]
-
S. Bae, P. Mangalagiri and N. Vijaykrishnan
Clock skew scheduling (CSS) is an effective
technique to optimize clock period of sequential designs.
However, these techniques are not effective in the
presence of certain design structural constraints that limit
the CSS. In this paper, we present an analysis of several
design structural constraints that affect the CSS and
propose techniques to resolve these constraints.
Furthermore, we propose a CSS FPGA architecture and
a novel clock-period optimization (CPO) flow that tackles
some of these constraints by exploiting the reconfigurability
of FPGAs. Experimental results
demonstrate that the proposed FPGA architecture with
the CPO flow achieved an average performance
improvement of 24.4% which was an average
performance improvement of 10.7% over the CPO flow
without considering the constraints.
-
Accelerating FPGA-Based Emulation of Quasi-Cyclic LDPC Codes with Vector Processing
[p. 1530]
-
X. Chen, J. Kang, S. Lin and V. Akella
FPGAs are widely used for evaluating the error-floor
performance of LDPC (low-density parity check) codes. We propose
a scalable vector decoder for FPGA-based implementation
of quasi-cyclic (QC) LDPC codes that takes advantage of the
high bandwidth of the embedded memory blocks (called Block
RAMs in a Xilinx FPGA) by packing multiple messages into the
same word. We describe a vectorized overlapped message passing
algorithm that results in 3.5X to 5.5X speedup over state-of-theart
FPGA implementations in literature.
-
Runtime Reconfiguration of Custom Instructions for Real-Time Embedded Systems
[p. 1536]
-
H.P. Huynh and T. Mitra
This paper explores runtime reconfiguration of
custom instructions in the context of multi-tasking real-time
embedded systems. We propose a pseudo-polynomial time algorithm
that minimizes processor utilization through customization
and runtime reconfiguration, while satisfying all the timing
constraints. Our experimental infrastructure consists of Stretch
customizable processor supporting runtime reconfiguration as
the hardware platform and realistic embedded benchmarks as
applications. We observe that runtime reconfiguration of custom
instructions can help to reduce the processor utilization by up
to 64%. The experimental results also demonstrate that our
algorithm is highly scalable and achieves optimal or near optimal
(3% difference) processor utilization.
Organizer/Moderator: M. Dietrich, Fraunhofer IIS/EAS Dresden, DE
-
Digital Design at a Crossroads - How to Make Statistical Design Industrially Relevant [p. 1542]
-
U. Schlichtmann, M. Schmidt, M. Pronath, V. Glöckel, H. Kinzelbach, M. Dietrich, U. Eichler
and J. Haase
Statistical analysis is generally seen as the next EDA
technology for timing and power sign-off. Research into this field
has seen significant activity started about five years ago.
Recently, interest appears to have fallen off somewhat. Also,
while a lot of focus has been put on research fundamentals,
extremely few applications in industry have been reported so far.
Therefore, a group including Infineon Technologies as a leading
semiconductor IDM and various universities and research
institutes, as well as an EDA provider has tackled key challenges
to enable statistical design in industry in a publicly funded
project called "Sigma65". Sigma65 strives to provide key
foundations to allow a change from traditional deterministic
design methods to future design methods driven by statistical
considerations. The project starts with statistical modeling and
optimization of library components and ranges to statistical
techniques for designing ICs on gate level and higher levels. In
this paper, we present some results of this project, demonstrating
how the interaction between industrial perspective, research
institutions and EDA provider enables solutions which are
applicable already in the near future. After an overview of the
industrial perspective of the current situation in dealing with
variations recent results on both statistical timing and power
analysis will be given. In addition, recent research advances on
fast yield estimation concerning parametric timing yield will be
given.
Keywords: Simulation, digital IC design, statistical timing
analysis, statistical power analysis
-
Performance Optimal Speed Control of Multi-Core Processors under Thermal Constraints
[p. 1548]
-
V. Hanumaiah, S. Vrudhula and K. Chatha
Advances in chip-multiprocessor processing capabilities
has led to an increased power consumption and temperature
hotspots. Maintaining the on-chip temperature is important
from the power reduction and reliability considerations. Achieving
highest performance while maintaining the temperature
constraint is a challenge. We develop analytical solutions for
the optimal control of frequencies for each core in a chip-multiprocessor.
The objective is to reduce the makespan or the
latest task completion time of all tasks. We show that the optimal
frequency policy is bang-bang when the temperature constraint
is not active and is exponential when the temperature constraint
is active. We show that there is a significant improvement in
overall throughput with our proposed solution and yet all cores
operate under the thermal maximum.
-
Scalable Compile-Time Scheduler for Multi-Core Architectures
[p. 1552]
-
M. Pelcat, P. Menuet, S. Aridhi and J.-F. Nezan
As the number of cores continues to grow in both digital
signal and general purpose processors, tools which perform
automatic scheduling from model-based designs are
of increasing interest. This scheduling consists of statically
distributing the tasks that constitute an application between
available cores in a multi-core architecture in order to minimize
the final latency. This problem has been proven to
be NP-complete. A static scheduling algorithm is usually
described as a monolithic process, and carries out two distinct
functionalities: choosing the core to execute a specific
function and evaluating the cost of the generated solutions.
This paper describes a scheduling module which splits
these functionalities into two sub-modules. This division
produces an advanced scalability in terms of schedule
quality and computation time, and also separates the heuristic
complexity from the architecture model precision.
-
Distributed Peak Power Management for Many-Core Architectures
[p. 1556]
-
J. Sartori and R. Kumar
Recently proposed techniques for peak power management
[4] involve centralized decision-making and assume
quick evaluation of the various power management states.
These techniques do not prevent instantaneous power from
exceeding the peak power budget, but instead trigger corrective
action when the budget has been exceeded. Similarly,
they are not suitable for many-core architectures (processors
with tens or possibly hundreds of cores on the same
die) due to an exponential explosion in the number of global
power management states.
In this paper, we look at a hierarchical and a gradient
ascent-based technique for decentralized peak power management
for many-core architectures. The proposed techniques
prevent power from exceeding the peak power budget
and enable the placement of several more cores on a die
than what the power budget would normally allow. We show
up to 47% (33% on average) improvements in throughput
for a given power budget. Our techniques outperform the
static oracle by 22%.
-
Generating the Trace Qualification Configuration for MCDS from a High Level Language
[p. 1560]
-
J. Braunes and R.G. Spallek
This paper introduces a high level trace qualification
language and compiler which enables the user defining analysis
tasks efficiently and fully utilize the powerful features of Infineon's
Multi-Core Debug Solution (MCDS) without the need of
getting into the internals. The language and the compiler are
already in industrial use where software development is based
on MCDS enabled SoCs to support the developers to achieve
better product quality and shorter product development cicles.
-
Dynamic and Distributed Frequency Assignment for Energy and Latency Constrained MP-SoC
[p. 1564]
-
D. Puschini, F. Clermidy, P. Benoit, G. Sassatelli and L. Torres
In this paper we present an adaptive technique to
locally adjust the frequency of processing elements on MP-SoC.
The proposed method, based on Game Theory, optimizes the
system while fulfilling dynamic constraints. A telecom test-case
has been used to demonstrate the effectiveness of our technique.
For the evaluated scenario, the proposed technique has obtained
up to 20% of latency gain and 38% of energy gain.
-
A MILP-Based Approach to Path Sensitization of Embedded Software
[p. 1568]
-
J.C. Costa and J.C. Monteiro
We propose a new methodology based on Mixed
Integer Linear Programming (MILP) for determining the input
values that will exercise a specified execution path in a program.
In order to seamlessly handle variable values, pointers and
arrays, and variable aliasing, our method uses memory addresses
for data references. This implies a dynamic methodology where
all decisions are taken as the program executes. During execution,
we gather constraints for the MILP problem, whose solution will
directly yield the input values for the desired path. We present
results that demonstrate the effectiveness of this approach. This
methodology was implemented into a fully functional tool that
is capable of handling medium sized real programs specified in
the C language. Our work is motivated by the complexity of
validating embedded systems and uses a similar approach to an
existing HDL functional vector generation. The joint solution of
the MILP problems will provide a hardware/software co-validation
tool.
-
An Efficient and Deterministic Multi-Tasking Run-Time Environment for Ada and the
Ravenscar Profile on Atmel AVR ®32 UC3 Microcontroller [p. 1572]
-
K. Nyborg Gregertsen and A. Skavhaug
This paper describes how an efficient and deterministic
multitasking run-time environment supporting
the Ravenscar tasking model of Ada 2005 was implemented
on the Atmel AVR32 UC3A microcontroller.
The open source GNU Ada Compiler (GNAT GPL
2007) was also ported to AVR32 as a part of this
work, making a working Ada development environment
available on the architecture for the first time.
-
Toward a Runtime System for Reconfigurable Computers: A Virtualization Approach
[p. 1576]
-
M. Sabeghi and K. Bertels
In this paper we propose a virtualization layer to
handle the program execution on reconfigurable computers
in order to address one of their biggest problems which is the
management of the reconfigurable hardware in a multitasking
environment. The virtualization layer is responsible
for allocating the hardware at run-time based on the status
of the system. Furthermore, it provides a consistent and low
overhead interface to decouple the process of software
development from hardware design which will result in the
software to be independent of the underlying reconfigurable
hardware. This paper discusses the virtual layer's
specification and components. Our preliminary results for a
prototype simulated on Molen hardware organization show
a competitive performance comparing with an optimal
hardware allocation.
Keywords-component; run-time support, reconfigurable
computers, virtualizationa
-
Separate Compilation and Execution of Imperative Synchronous Modules
[p. 1580]
-
E. Vecchie, J.-P. Talpin and K. Schneider
The compilation of imperative synchronous languages like
Esterel has been widely studied, the separate compilation of synchronous
modules has not, and remains a challenge. We propose a new compilation
method inspired by traditional sequential code generation techniques
to produce coroutines whose hierarchical structure reflects the control
flow of the original source code. A minimalistic runtime system executes
separately compiled modules.
Organizer: R. Leupers, RWTH Aachen U, DE
Moderator: M. de Lange, ACE, NL
-
Programming MPSoC Platforms: Road Works Ahead!
[p. 1584]
-
R. Leupers, S. Ha, A. Vajda, R. Doemer, M. Bekooij and A. Nohl
This paper summarizes a special session on multicore/
multi-processor system-on-chip (MPSoC) programming
challenges. The current trend towards MPSoC platforms in most
computing domains does not only mean a radical change in
computer architecture. Even more important from a SW
developer's viewpoint, at the same time the classical sequential
von Neumann programming model needs to be overcome.
Efficient utilization of the MPSoC HW resources demands for
radically new models and corresponding SW development tools,
capable of exploiting the available parallelism and guaranteeing
bug-free parallel SW. While several standards are established in
the high-performance computing domain (e.g. OpenMP), it is
clear that more innovations are required for successful
deployment of heterogeneous embedded MPSoC. On the other
hand, at least for coming years, the freedom for disruptive
programming technologies is limited by the huge amount of
certified sequential code that demands for a more pragmatic,
gradual tool and code replacement strategy.
Moderators: J. Baumgartner, IBM Corporation, US ; G. Cabodi, Politecnico di Torino, IT
-
Faster SAT Solving with Better CNF Generation
[p. 1590]
-
B. Chambers, P. Manolios and D. Vroon
Boolean satisfiability (SAT) solving has become an enabling
technology with wide-ranging applications in numerous
disciplines. These applications tend to be most naturally
encoded using arbitrary Boolean expressions, but to
use modern SAT solvers, one has to generate expressions in
Conjunctive Normal Form (CNF). This process can significantly
affect SAT solving times. In this paper, we introduce a
new linear-time CNF generation algorithm. We have implemented
our algorithm and have conducted extensive experiments,
which show that our algorithm leads to faster SAT
solving times and smaller CNF than existing approaches.
-
Exploiting Structure in an AIG Based QBF Solver
[p. 1596]
-
F. Pigorsch and C. Scholl
In this paper we present a procedure for solving
quantified boolean formulas (QBF), which uses And-Inverter
Graphs (AIGs) as the core data-structure. We make extensive
use of structural information extracted from the input formula
such as functional definitions of variables and non-linear quantifier
structures. We show how this information can directly
be exploited by the symbolic, AIG based representation. We
implemented a prototype QBF solver based on our ideas and
performed a number of experiments proving the effectiveness of
our approach, and moreover, showing that our method is able
to solve QBF instances on which state-of-the-art QBF solvers
known from literature fail.
-
An Efficient Path-Oriented Bitvector Encoding Width Computation Algorithm for Bit-Precise
Verification
[p. 1602]
-
N. He and M.S. Hsiao
Bit-precise verification with variables modeled as bitvectors
has recently drawn much interest. However, a huge
search space usually results after bit-blasting. To accelerate
the verification of bit-vector formulae, we propose an efficient
algorithm to discover non-uniformencoding widthsWe
of variables in the verification model, which may be smaller
than their original modeling widths but sufficient to find a
counterexample. Different from existing approaches, our algorithm
is path-oriented, in that it takes advantage of the
controllability and observability values in the structure of
the model to guide the computation of the paths, their encoding
widths and the effective adjustment of these widths
in subsequent steps. For path selection, a subset of singlebit
path-controlling variables is set to constant values. This
can restrict the search from those paths deemed less favorable
or have been checked in previous steps, thus simplifying
the problem. Experiments show that our algorithm can
significantly speed up the search by focusing first on those
promising, easy paths for verifying those path-intensive models,
with reduced, non-uniform bitwidth encoding.
Moderators: F. Kienle, TU Kaiserslautern, DE; W. Eberle, IMEC, BE
-
Algorithm-Architecture Co-Design of Soft-Output ML MIMO Detector for Parallel Application
Specific Instruction Set Processors
[p. 1608]
-
M. Li, R. Fasthuber, D. Novo, B. Bougard, L. Van Der Perre and F. Catthoor
Emerging SDR baseband platforms are usually based
on multiple DLP+ILP processors with massive parallelism
[10]. Although these platforms would theoretically enable
advanced SDR signal processing, existing work implemented
basic systems and simple algorithms. Importantly,
MIMO is not fully supported in most implementations
[7][9][11]. [1] implemented MIMO but with a simple
linear detector. Our work explores the feasibility for SDR
implementations of soft-output ML MIMO detectors, which
brings 6-12 dB SNR gains when compared to popular linear
detectors. Although soft-output ML MIMO detectors
are considered to be challenging even for ASICs [3][4], we
combine architecture-friendly algorithms, application specific
instructions, code transformations and ILP/DLP explorations
to make SDR implementations feasible. In our work,
a 2x4 ADRES based ASIP with 16-way SIMD can deliver
193Mbps for 2x2 64QAM, and 368Mbps for 2x2 16QAM
transmissions. To the best of our knowledge, this is the first
work exploring SDR based soft-output ML MIMO detectors.
-
A Low-Power ASIP for IEEE 802.15.4a Ultra-Wideband Impulse Radio Baseband Processing
[p. 1614]
-
C. Bachmann, A. Genser, J. Hulzink, M. Berekovic and C. Steger
The IEEE 802.15.4a amendment has introduced
ultra-wideband impulse radio (UWB IR) as a promising physical
layer for energy-efficient, low data rate communications. A
critical part of the UWB IR receiver design is the low-power
implementation of the digital baseband processing required for
synchronization and data decoding. In this paper we present the
development of an application-specific instruction-set processor
(ASIP) that is tailored to the requirements defined by the
baseband algorithms. We report a number of optimizations
applied to the algorithms as well as to the hardware architecture.
This enables performance increases up to a factor of 122x and
energy consumption decreases up to 90x as compared to a 16-bit
baseline architecture. Furthermore, this ASIP offers greater
flexibility due to programmability as compared to an ASIC
implementation.
-
ASIP-Based Flexible MMSE-IC Linear Equalizer for MIMO Turbo-Equalization Applications
[p. 1620]
-
A.R. Jafri, D. Karakolah, A. Baghdadi and M. Jezequel
A novel 16-bit flexible Application-Specific Instruction-set
Processor for an MMSE-IC Linear Equalizer, used in
iterative turbo receiver, is presented in this paper. The
proposed ASIP has an SIMD architecture with a specialized
instruction-set and 7-stage pipeline control. It supports
diverse requirements of MIMO-OFDM wireless standards
such as use of QPSK, 16-QAM and 64-QAM modulation
in 2x2 and 4x4 spatially multiplexed MIMO-OFDM
environment. For these various operational modes, analysis
of MMSE-IC LE equations and corresponding complex
data representations was conducted. Efficient computational
and storage resource sharing is proposed through:
(1) Matrix Register Banks (MRB) multiplexing, (2) 16-bit
Complex Arithmetic Unit (CAU) comprised of 4 combined
complex adder/subtractor/multiplier units, 2 real multipliers,
5 complex adders, and 2 complex subtractors, and (3)
flexible 32-bit to 16-bit data conversion at multipliers' output.
With this architecture, the designed ASIP ensures, along
with flexibility, high performance in terms of throughput
and area. Logic synthesis results reveal a maximum clock
frequency of 546 MHz and a total area of 0.37 mm2
using 90 nm technology. For 2x2 spatially multiplexed MIMO
system, the proposed ASIP achieves a throughput of 273
MSymbol/Sec.
-
Implementation of a Reduced-Lattice MIMO Detector for OFDM Systems
[p. 1626]
-
J. Soler-Garrido, H. Vetter, M. Sandell, D. Milford and A. Lillie
This paper presents a novel VLSI implementation of
a MIMO detector for OFDM systems. The proposed architecture
is able to perform both linear MMSE and reduced latticeaided
MIMO detection, making it possible to adjust the balance
between performance and power consumption. In order to
facilitate real-time detection in reduced lattice mode of operation,
a novel fixed-complexity version of the LLL lattice reduction
algorithm has been developed, allowing for strict practical timing
requirements, such as those specified for new generation IEEE
802.11n wireless LAN systems, to be met. An implementation of
the MIMO detector for a system employing up to 4 transmit and
receive antennas is described and its complexity and performance
are evaluated.
Moderators: I. Harris, UC Irvine, US; V. Bertacco, U of Michigan, US
-
Increased Accuracy through Noise Injection in RTOS Simulation
[p. 1632]
-
H. Zabel and W. Mueller
Today, mobile and embedded real-time systems have
to cope with the migration and allocation of multiple software
tasks running on top of a real-time operating system (RTOS)
residing on one or multiple system processors. RTOS
simulations and timing analysis applies for fast and early
estimation to configure it towards the individual needs of the
application and environment. In this context, a high accuracy of
the simulation compared to an instruction set simulation (ISS)
is of key importance. In this paper, we investigate the accuracy
of abstract RTOS simulation and compare it to ISS and the
behavior of the physical system. We show that we can reach an
increased accuracy of the simulation when we inject noise into
the time model. Our results indicate that it is sufficient to inject
uniformly distributed random time values to the RTOS real-time
clock.
-
Flexible Energy-Aware Simulation of Heterogeneous Wireless Sensor Networks
[p. 1638]
-
F. Fummi, G. Perbellini, D. Quaglia and A. Acquaviva
This paper presents an accurate and scalable implementation of an energy-aware
simulator for wireless sensor networks (WSN's). Scalability and accuracy have been
achieved through an energy-aware instrumentation of the Instruction Set
Simulator of node's microcontroller and a functional SystemC TLM model of the
radio module implementing the IEEE 802.15.4 protocol. The framework allows
to execute actual software and to evaluate accurately its effect on the network
lifetime. We first prototype of a wireless sensor node. The methodology,
compared against state-of-the-art simulators such as NS-2, represents a
flexible and scalable solution for fast and accurate prototyping of WSN
software.
-
Selective State Retention Design Using Symbolic Simulation
[p. 1644]
-
A. Darbari, B.M. Al-Hashimi, D. Flynn and J. Biggs
Addressing both standby and active power is a major
challenge in developing System-on-Chip designs for battery-powered
products. Powering off sections of logic or memories
loses internal register and RAM states so designers have to weigh
up the benefits and costs of implementing state retention on
some or all of the power gated subsystems where state recovery
has significant real-time or energy cost, compared to resetting
the subsystem and re-acquiring state from scratch. Library
IP and EDA tools can support state retention in hardware
synthesized from standard RTL, but due to the silicon area costs
there is strong interest in only retaining certain selective state
for example the "architectural state" of a CPU to implement
sleep modes. Currently there is no known rigourous technique
for checking the integrity of selective state retention, and this
is due to the complexity of checking that the correctness of
the design is not compromised in any way. The complexity is
exacerbated due to the interaction between the retained and the
non-retained state, and exhaustive simulation rapidly becomes
infeasible. This paper presents a case study based on symbolic
simulation for assisting the designers to design and implement
selective retention correctly. The main finding of our study is
that the programmer visible state or the architectural state
of the CPU needs to be implemented using retention registers
whilst other micro-architectural enhancements such as pipeline
registers, TLBs and caches can be implemented using normal
registers without retention. This has a profound impact on power
and area savings for chip design. By selectively retaining the
state of the programmer's "architectural" model and not the
increasing proportion of extra state, one can incorporate
energy-efficient sleep modes. To the best of our knowledge this is the
first study in the area of rigourous design and implementation
of selective state retention.
Moderators: J. Machado da Silva, INESC, PT; C. Wegener, Infineon Technologies, DE
-
A Loopback-Based INL Test Method for D/A and A/D Converters Employing a Stimulus
Identification Technique
[p. 1650]
-
E. Korhonen and J. KostamovaaraI
We propose a new method for the integral nonlinearity
(INL) and differential nonlinearity (DNL) testing of
D/A - A/D converter pairs employing the recently developed
stimulus identification method. This allows both converters to be
measured independently but simultaneously without significant
fault masking problems. Simulations show that the INL and DNL
estimation errors for 12-b A/D and D/A converters are less than
0.5 least significant bit (LSB) units, and experimental tests give
similar results.
-
A Novel Self-Healing Methodology for RF Amplifier Circuits Based on Oscillation Principles
[p. 1656]
-
A. Goyal, M. Swaminathan and A. Chatterjee
This paper proposes a novel self-healing methodology
for embedded RF Amplifiers (LNAs) in RF sub-systems. The
proposed methodology is based on oscillation principles in which
the Device-under-Test (DUT) itself generates the output test
signature with the help of additional circuitry. The self-generated
test signature from the DUT is analyzed by using on-chip
resources for testing the LNA and controlling its calibration
knobs to compensate for multi-parameter variations in the LNA
manufacturing process. Thus, the proposed methodology enables
self-test and self-calibration of RF circuits without the need for
external test stimulus. The proposed methodology is
demonstrated through simulations as well as measurements
performed on a RF LNA.
-
An Approach to Linear Model-Based Testing and Diagnosis for Nonlinear Cascaded
Mixed-Signal Systems
[p. 1662]
-
R. Mueller, C. Wegener, H.-J. Jentschel, S. Sattler and H. Mattes
Linear Model-based Test and Diagnosis (MbT&D)
has been successfully applied to single-block modules like
Digital-to-Analog Converters (DACs) with a static non-linear transfer
characteristic. For Multi-block modules, a diagnosis methodology
is needed that can deal with cascades of several linear and nonlinear
blocks.
In contrast to non-linear methods, linear MbT&D methods
only require matrix operations associated with relatively low
computational effort. A modification of the linear MbT&D in
combination with Volterra series is presented that can be applied
to cascaded non-linear systems, for example, a DAC followed
by a low-pass filter. A simultaneous identification of numerous
frequency domain Volterra kernels is enabled, and thus, to test
the compliance to data sheet specifications.
-
Enrichment of Limited Training Sets in Machine-Learning-Based Analog/RF Test
[p. 1668]
-
H.-G. Stratigopoulos, S. Mir and Y. Makris
This paper discusses the generation of informationrich,
arbitrarily-large synthetic data sets which can be used
to (a) efficiently learn tests that correlate a set of low-cost
measurements to a set of device performances and (b) grade
such tests with parts per million (PPM) accuracy. This is achieved
by sampling a non-parametric estimate of the joint probability
density function of measurements and performances. Our case
study is an ultra-high frequency receiver front-end and the
focus of the paper is to learn the mapping between a lowcost
test measurement pattern and a single pass/fail test decision
which reflects compliance to all performances. The small fraction
of devices for which such a test decision is prone to error
are identified and retested through standard specification-based
test. The mapping can be set to explore thoroughly the tradeoff
between test escapes, yield loss, and percentage of retested
devices.
Moderators: J. Marques-Silva, Southampton U, UK; R. Bloem, TU Graz, AT
-
Speculative Reduction-Based Scalable Redundancy Identification
[p. 1674]
-
H. Mony, J. Baumgartner, A. Mishchenko and R. Brayton
The process of sequential redundancy identification is
the cornerstone of sequential synthesis and equivalence
checking frameworks. The scalability of the proof obligations
inherent in redundancy identification hinges not only
upon the ability to cross-assume those redundancies, but
also upon the way in which these assumptions are lever-aged.
In this paper, we study the technique of speculative
reduction for efficiently modeling redundancy assumptions.
We provide theoretical and experimental evidence
to demonstrate that speculative reduction is fundamental to
the scalability of the redundancy identification process under
various proof techniques. We also propose several techniques
to speed up induction-based redundancy identification. Experiments
demonstrate the effectiveness of our tech niques in enabling
substantially faster redundancy identification, up to six orders
of magnitude on large designs.
-
Scalable Liveness Checking via Property-Preserving Transformations
[p. 1680]
-
J. Baumgartner and H. Mony
The ability of logic transformations to enhance safety
property checking has been well-established, and many
industrial-strength verification solutions accordingly rely
upon a variety of synthesis and abstraction techniques for
speed and scalability. However, little prior work has addressed
the applicability of such transformations in the domain
of liveness checking. In this paper, we provide the
theoretical foundation to enable the efficient use of a variety
of (possibly customized) transformations in a liveness-checking
framework. We demonstrate the practical utility of
this theory on a variety of complex verification prolems.
-
Speeding up Model Checking by Exploiting Explicit and Hidden Verification Constraints
[p. 1686]
-
G. Cabodi, P. Camurati, L. Garcia, M. Murciano, S. Nocco and S. Quer
Constraints represent a key component of state-of-the-art
verification tools based on compositional approaches
and assume-guarantee reasoning. In recent years, most of the
research efforts on verification constraints have focused on
defining formats and techniques to encode, or to synthesize,
constraints starting from the specification of the design.
In this paper, we analyze the impact of constraints on the
performance of model checking tools, and we discuss how to
effectively exploit them. We also introduce an approach to
explicitly derive verification constraints hidden in the design
and/or in the property under verification. Such constraints may
simply come from true design constraints, embedded within the
properties, or may be generated in the general effort to reduce or
partition the state space. Experimental results show that, in both
cases, we can reap benefits for the overall verification process
in several hard-to-solve designs, where we obtain speed-ups of
more than one order of magnitude.
-
Strengthening Properties Using on Refinement
[p. 1692]
-
M. Purandare, T. Wahl and D. Kroening
Model Checking is an automated formal method for
verifying whether a finite-state system satisfies a user-supplied
specification. The usefulness of the verification result depends
on how well the specification distinguishes intended from non-intended
system behavior. Vacuity is a notion that helps formalize
this distinction in order to improve the user's understanding of
why a property is satisfied. The goal of this paper is to expose
vacuity in a property in a way that increases our knowledge of the
design. Our approach, based on abstraction refinement, computes
a maximal set of atomic subformula occurrences that can be
strengthened without compromising satisfaction. The result is a
shorter and stronger and thus, generally, more valuable property.
We quantify the benefits of our technique on a substantial set of
circuit benchmarks.
Moderators: M. Fujita, Tokyo U, JP; V. Kravets, IBM, US
-
Sequential Logic Rectifications with Approximate SPFDs
[p. 1698]
-
Y.-S. Yang, S. Sinha, A. Veneris, R.K. Brayton and D. Smith
In the digital VLSI cycle, logic transformations are often required to
modify the design to meet different synthesis and optimization goals.
Logic transformations on sequential circuits are hard to perform due to
the vast underlying solution space. This paper proposes an SPFD-based
sequential logic transformation methodology to tackle the problem with
no sacrifice on performance. It first presents an efficient approach to
construct approximate SPFDs (aSPFDs) for sequential circuits. Then,
it demonstrates an algorithm using aSPFDs to perform the desirable sequential
logic transformations using both combinational and sequential
don't cares. Experimental results show the effectiveness and robustness
of the approach.
Variable-Latency Design by Function Speculation
[p. 1704]
-
D. Baneres, J. Cortadella and M. Kishinevsky
Variable-latency designs may improve the performance of
those circuits in which the worst-case delay paths are infrequently activated.
Telescopic units emerged as a scheme to automatically synthesize
variable-latency circuits. In this paper, a novel approach is proposed that
brings three main contributions with regard to the methods used for
telescopic units: first, no multi-cycle timing analysis is required to ensure
the correctness of the circuit; second, the method can be applied to large
circuits; third, the circuit can be optimized for the most frequent input
patterns. The approach is based on finding approximations of critical
nodes in the netlist that substitute the exact behavior. Two cycles are
required when the approximations are not correct. These approximations
can be obtained by the simulation of traces applied to the circuit.
Experimental results on selected examples show a tangible speed-up
(15%) with a small area overhead (3%).
-
Fixed Points for Multi-Cycle Path Detection
[p. 1710]
-
V. D'Silva and D. Kroening
Accurate timing analysis is crucial for obtaining
the optimal clock frequency, and for other design stages such
as power analysis. Most methods for estimating propagation
delay identify multi-cycle paths (MCPs), which allow timing to
be relaxed, but ignore the set of reachable states, achieving
scalability at the cost of a severe lack of precision. Even
simple circuits contain paths affecting timing that can only
be detected if the set of reachable states is considered. We
examine the theoretical foundations of MCP identification and
characterise the MCPs in a circuit by a fixed point equation. The
optimal solution to this equation can be computed iteratively
and yields the largest set of MCPs in a circuit. Further, we
define conservative approximations of this set, show how different
MCP identification methods in the literature compare in terms
of precision, and show one method to be unsound. The practical
application of these results is a new method to detect multi-cycle
paths using techniques for computing invariants in a circuit. Our
implementation performs well on several benchmarks, including
an exponential improvement on circuits analysed in the literature.
|