Sessions: [Keynotes] [2.2] [2.3] [2.4] [2.5] [2.6] [2.7] [2.8] [3.2] [3.3] [3.4] [3.5] [3.6] [3.7] [3.8] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [5.1] [5.2] [5.3] [5.4] [5.5] [5.6] [5.7] [6.1] [6.2] [6.3] [6.4] [6.5] [6.6] [6.7] [7.1] [7.2] [7.3] [7.4] [7.5] [7.6] [7.7] [8.1] [8.2] [8.3] [8.4] [8.5] [8.6] [8.7] [8.8] [9.1] [9.2] [9.3] [9.4] [9.5] [9.6] [9.7] [10.1] [10.2] [10.3] [10.4] [10.5] [10.6] [10.7] [10.8] [11.1] [11.2] [11.3] [11.4] [11.5] [11.6] [11.7] [11.8] [12.1] [12.2] [12.3] [12.4] [12.5] [12.6] [12.7] [12.8]
DATE Executive Committee
DATE Sponsors Committee
Technical Program Topic Chairs
Technical Program Committee
Best Paper Awards
Call for Papers: DATE 2014
Sensors add intelligence to systems which represent a broad class of devices incorporating functionalities like sensing, actuation, and control. They are the core of smart components and subsystems; then, the challenge in the realization of such smart systems goes beyond the design of the individual components and subsystems and consists of accommodating a multitude of functionalities, technologies, and materials to play a key role to augment our daily life.
Modern society's dependence on information and communication infrastructure (ICI) is so deeply entrenched that it should be treated on par with other critical lifelines of our existence, such as water and electricity. As is the case with any true lifeline, ICI must be reliable, affordable, and sustainable. Meeting these requirements (especially sustainability) is a continued critical challenge, which will be the focus of my talk. More precisely, I will provide an overview of information and communication technology trends in light of various societal and environmental mandates followed by a review of technologies, systems, and hardware/software solutions required to create a sustainable ICI.
Parallel Discrete Event Simulation (PDES) enables efficient validation of ESL models on multi-core simulation hosts. Out-of-order PDES is an advanced scheduling technique which allows multiple threads to run in parallel even in different simulation cycles. To maintain simulation semantics and timing accuracy, the compiler performs complex static conflict analysis so that the scheduler can make quick and safe decisions at run time and issue threads early. Often, however, out-of-order scheduling is prevented because of the unknown future behavior of the threads. In this paper, we extend the analysis in order to predict the future of candidate threads. Looking ahead of the current simulation state allows the scheduler to issue more threads in parallel, resulting in significantly reduced simulator run time. Our experimental results show simulation speedup up to 1.92x with only negligible increase in compile time.
The SystemC/TLM technologies are widely accepted in the industry for fast system-level simulation. An important limitation of SystemC regarding performance is that the reference implementation is sequential, and the official semantics makes parallel executions difficult. As the number of cores in computers increase quickly, the ability to take advantage of the host parallelism during a simulation is becoming a major concern. Most existing work on parallelization of SystemC targets cycle-accurate simulation, and would be inefficient on loosely timed systems since they cannot run in parallel processes that do not execute simultaneously. We propose an approach that explicitly targets loosely timed systems, and offers the user a set of primitives to express tasks with duration, as opposed to the notion of time in SystemC which allows only instantaneous computations and time elapses without computation. Our tool exploits this notion of duration to run the simulation in parallel. It runs on top of any (unmodified) SystemC implementation, which lets legacy SystemC code continue running as-it-is. This allows the user to focus on the performance-critical parts of the program that need to be parallelized.
Fixed-point format is essential to most efficient Digital Signal Processing (DSP) implementations. The conversion of an algorithm specification to fixed-point precision targets the minimization of the implementation cost while guaranteeing a minimal processing accuracy. However, measuring such processing accuracy can be extremely time consuming and lead to long design cycles. In this paper, we study reference approaches to measure fixed-point errors of Linear Time-Invariant (LTI) systems without feedback. Unsurprisingly, we find the existing analytical approach significantly faster than a straightforward simulation-based estimation. However, we also show that such analytical approach can incur high estimation errors for some particular bitwidth configurations. Accordingly, we propose a new hybrid approach, which is able to reduce by up to 4 times the error of the analytical estimation, while still being more than 10 times faster than the simulation-based estimation.
Because of complexity of analog circuits, their verification presents many challenges. We propose a runtime verification algorithm to verify design properties of nonlinear analog circuits. Our algorithm is based on performing exploratory simulations in the state-time space using the Time-augmented Rapidly Exploring Random Tree (TRRT) algorithm. The proposed runtime verification methodology consists of i) incremental construction of the TRRT to explore the state-time space and ii) use of an incremental online monitoring algorithm to check whether or not the incremented TRRT satisfies or violates specification properties at each iteration. In comparison to the Monte Carlo simulations, for providing the same state-space coverage, we utilize a logarithmic order of memory and time
Simulation of complex embedded and cyber-physical systems requires exploitation of the computation power of available parallel architectures. Current simulation environments either do not address this parallelism or use separate models for parallel simulation and for analysis and synthesis, which might lead to model mismatches. We extend a formal modeling framework targeting heterogeneous systems with elements that enable parallel simulations. An automated flow is then proposed that starting from a serial executable specification generates an efficient MPI-based parallel simulation model by using a constraint-based method. The proposed flow generates parallel models with acceptable speedups for a representative example.
Mutation testing is an established technique for evaluating validation thoroughness, but its adoption has been limited by the manual effort required to analyze the results. This paper describes the use of coverage discounting for mutation analysis, where undetected mutants are explained in terms of functional coverpoints, simplifying their analysis and saving effort. Two benchmarks are shown to compare this improved flow against regular mutation analysis. We also propose a confidence metric and simulation ordering algorithm optimized for coverage discounting, potentially reducing overall simulation time.
SystemC and Transaction Level Modeling (TLM) have become the de-facto standard for Electronic System Level (ESL) design. For the costly task of verification at ESL, simulation is the most widely used and scalable approach. Besides the Design Under Test (DUT), the TLM verification environment typically consists of stimuli generators and checkers where the latter are responsible for detecting errors. However, in case of an error, the subsequent debugging process is still very time-consuming. In this paper, we present a scalable fault localization approach for SystemC TLM designs. The approach targets the described standard TLM verification environment and can be easily integrated into one. Our approach is inspired by software diagnosis techniques. We extend the concept of execution profiles of software programs, also known as program spectra, to handle the TLM simulation. The whole simulation consists of several runs; each run corresponds to the request-DUT-response path. During simulation our approach individually collects spectra for each run. Then, based on analyzing the differences of passed and failed runs we determine possible fault locations. We demonstrate the quality of our approach by several experiments including TLM-2.0 designs. As shown in the experiments, the fault locations are identified accurately and very fast.
It is projected that increasing on-chip integration with technology scaling will lead to the so-called dark silicon era in which more transistors are available on a chip than can be simultaneously powered on. It is conventionally assumed that the dark silicon will be provisioned with heterogeneous resources, for example dedicated hardware accelerators. In this paper we challenge the conventional assumption and build a case for homogeneous dark silicon CMPs that exploit the inherent variations in process parameters that exist in scaled technologies to offer increased performance. Since process variations result in core-to-core variations in power and frequency, the idea is to cherry pick the best subset of cores for an application so as to maximize performance within the power budget. To this end, we propose a polynomial time algorithm for optimal core selection, thread mapping and frequency assignment for a large class of multi-threaded applications. Our experimental results based on the Sniper multi-core simulator show that up to 22% and 30% performance improvement is observed for homogeneous CMPs with 33% and 50% dark silicon, respectively.
Pipelined computing is a promising paradigm for embedded system design. Designing the scheduling policy for a pipelined system is however more involved. In this paper, we study the problem of the energy minimization for coarse-grained pipelined systems under hard real-time constraints and propose a method based on an inverse use of the pay-burst-only-once principle. We formulate the problem by means of the resource demands of individual pipeline stages and solve it by quadratic programming. Our approach is scalable w.r.t the number of the pipeline stages. Simulation results using real-life applications as well as commercialized processors are presented to demonstrate the effectiveness of our method.
We present a self-adaptive, hybrid Dynamic Power Management (DPM) scheme for many-core systems that targets concurrently executing applications with what we call "expanding" and "shrinking" resource allocations as, for example, in - . To avoid frequent allocation and de-allocation, it enables applications to temporarily reserve their resources and to perform local power management decisions. The expand-to-shrink time periods and resource demands are predicted on-the-fly based on the application-specific knowledge and the monitored system information. Experimental results demonstrate up to 15%-40% Energy-Delay2 Product reduction of our scheme compared to state-of-the-art power management schemes like . Self-adaptive local power-management decisions make our scheme scalable for large-scaled many-core systems as illustrated by numerous experiments.
Power efficiency is increasingly critical to battery-powered smartphones. Given the using experience is most valued by the user, we propose that the power optimization should directly respect the user experience. We conduct a statistical sample survey and study the correlation among the user experience, the system runtime activities, and the minimal required frequency of an application processor. This study motivates an intelligent self-adaptive scheme, SmartCap, which automatically identifies the most power-efficient state of the application processor according to system activities. Compared to prior Linux power adaptation schemes, SmartCap can help save power from 11% to 84%, depending on applications, with little decline in user experience.
Modeling and estimating power consumption of
OLED displays are necessary to understand the energy behavior
of emerging mobile devices. Although previous study exists to
model and estimate the power consumption of stationary display
images, to the best of our knowledge, no prior work is found to
deal with runtime power behavior of OLED display running real
applications. This paper proposes a runtime power estimation
scheme for OLED displays that involves monitoring kernel
activities that capture the screen change events of running
applications. The experiment results show that the proposed
scheme estimates the display energy consumption of running
applications with reasonable accuracy.
Index Terms - Power, energy, modeling, estimation
Ever scaling process technology increases variations in transistors. The process variations cause large fluctuations in the access times of SRAM cells. Caches made of those SRAM cells cannot be accessed within the target clock cycle time, which reduces yield of processors. To combat these access time failures in caches, many schemes have been proposed, which are, however, limited in their coverage and do not scale well at high failure rates. We propose a new L1 cache architecture (AVICA) employing asymmetric pipelining and pseudo multi-banking. Asymmetric pipelining eliminates all access time failures in L1 caches. Pseudo multi-banking minimizes the performance impact of asymmetric pipelining. For further performance improvement, architectural techniques are proposed. Our experimental results show that our proposed L1 cache architecture incurs less than 1% performance hit compared to the conventional cache architecture with no access time failure. Our proposed architecture is not sensitive to access time failure rates and has low overheads compared to the previously proposed competitive schemes.
Cache performance is an important factor in modern computing systems due to large memory access latency. To exploit the principle of spatial locality, a requested data set and its adjacent data sets are often loaded from memory to a cache block simultaneously. However, the definition of adjacent data sets is strongly correlated with the memory organization. Commodity memory is a two-dimensional structure with two (row and column) access phases to locate the requested data set. Therefore, the adjacent data sets are neighbors of the requested data set in a linear order. In this paper, we propose a novel memory organization with dual-addressing modes as well as orthogonal memory access mechanisms. Our dual-addressing memory can be efficiently applied to two-dimensional memory access patterns. Furthermore, we propose a cache coherence protocol to tackle the cache coherence issue due to synonym data set of the dual-addressing memory. For benchmark kernels with two-dimensional memory access patterns, the dual-addressing memory achieves 60% performance improvement as compared to conventional memory. Both cache hit rate and cache utilization are improved after removing two-dimensional memory access patterns from conventional memory.
On-chip DRAM caches may alleviate the memory bandwidth problem in future multi-core architectures through reducing off-chip accesses via increased cache capacity. For memory intensive applications, recent research has demonstrated the benefits of introducing high capacity on-chip L4-DRAM as Last-Level-Cache between L3-SRAM and off-chip memory. These multi-core cache hierarchies attempt to exploit the latency benefits of L3-SRAM and capacity benefits of L4-DRAM caches. However, not taking into consideration the cache access patterns of complex applications can cause inter-core DRAM interference and inter-core cache contention. In this paper, we contest to re-architect existing cache hierarchies by proposing a hybrid cache architecture, where the Last-Level-Cache is a combination of SRAM and DRAM caches. We propose an adaptive DRAM placement policy in response to the diverse requirements of complex applications with different cache access behaviors. It reduces inter-core DRAM interference and inter-core cache contention in SRAM/DRAM-based hybrid cache architectures: increasing the harmonic mean instruction-per-cycle throughput by 23.3% (max. 56%) and 13.3% (max. 35.1%) compared to state-of-the-art.
Low-power modes in modern microprocessors rely on low frequencies and low voltages to reduce the energy budget. Nevertheless, manufacturing induced parameter variations can make SRAM cells unreliable producing hard errors at supply voltages below Vccmin. Recent proposals provide a rather low fault-coverage due to the fault coverage/overhead trade-off. We propose a new fault-tolerant L1 cache, which combines SRAM and eDRAM cells in L1 data caches to provide 100% SRAM hard-error fault coverage. Results show that, compared to a conventional cache and assuming 50% failure probability at low-power mode, leakage and dynamic energy savings are by 85% and 62%, respectively, with a minimal impact on performance.
Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an application's working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for remote lookups in snoop-based multi-socket systems. Ideally, it would be possible to detect in advance that a request will miss in the DRAM cache and thus selectively bypass it. This work proposes a "dual grain filter" which successfully predicts whether an access is a hit or a miss in most cases. Experimental results with commercial and scientific workloads show that a 158KB dual-grain filter can correctly predict data block residency for 85% of all accesses to a 256MB DRAM cache. As a result, average off-die latency with our filter is within 8% of that possible with a perfectly accurate filter, which is impractical to implement.
Phase Change Memory (PCM) is currently postulated as the best alternative for replacing Dynamic Random Access Memory (DRAM) as the technology used for implementing main memories, thanks to its significant advantages such as good scalability and low leakage. However, PCM also presents some drawbacks compared to DRAM, like its lower endurance. This work presents a behavior analysis of conventional cache replacement policies in terms of the amount of writes to main memory. Besides, new last level cache (LLC) replacement algorithms are exposed, aimed at reducing the number of writes to PCM and hence increasing its lifetime, without significantly degrading system performance
In this paper we propose a QR-decomposition
hardware implementation that processes complex calculations in
the logarithmic number system. Thus, low complexity numeric
format converters are installed, using nonuniform piecewise and
multiplier-less function approximation. The proposed algorithm
is simulated with several different configurations in a downlink
precoding environment for 4x4 and 8x8 multi-antenna wireless
communication systems. In addition, the results are compared to
default CORDIC-based architectures.
In a second step, HDL implementation as well as logical and
physical CMOS synthesis are performed. The comparison to
actual references highlight our approach as highly efficient in
terms of hardware complexity and accuracy.
Index Terms - QR-Decomposition, Nonuniform function approximation, LNS
This work proposes a low power methodology for
video framebuffers to preserve the perceptual quality while
reducing SRAM power. The bank-wise voltage scaling combined
with error masking circuitry is proposed where voltage domains
are separated according to the importance of luminous and color
channels. The implementation may apply to standard embedded
memory cores without redesigning specialized hardware within
the SRAM bank. The simulation results showed that the
proposed channel protection technique produced better energy-quality
trade-off than the conventional higher-order-bit
protection for the uncompressed as well as compressed motion
Keywords - parametric failure, color image protection, low power circuit, static random access memory (SRAM).
Emerging wireless digital communication standards specify a large variety of channel coding options, each suitable for specific application needs. In this context, several recent efforts are being conducted to propose flexible channel decoder implementations. However, the need of optimal solutions in terms of performance, area, and power consumption is increasing and cannot be neglected against flexibility. In this paper we present a novel parameterized architecture for multi-standard Turbo decoding which illustrates how flexibility, architecture efficiency, and rapid design time can be combined. The proposed architecture supports both single-binary Turbo codes (SBTC) of 3GPP-LTE and double-binary Turbo codes (DBTC) of WiMAX and DVB-RCS standards. It achieves, in both modes, a high architecture efficiency of 4.37 bits/cycle/iteration/mm2. A major contribution of this work concerns the rapid design time allowed by the well established design concept and tools of application-specific instruction-set processors (ASIPs). Using such a tool, the paper illustrates the possibility to design application-specific parameterized cores, removing the need of the program memory and the related instruction decoder.
Video applications are moving from Full-HD capability (1920x1080) to even higher resolutions such as Quad-FullHD (3840x2160). The H.264 Intra-mode can be used by embedded devices to trade off the better encoding efficiency of H.264 temporal prediction (Inter-mode) against savings in area and power as well as saving the massive computational overhead of the sub-pixel motion estimation by using only spatial prediction (Intra-mode). Still, the H.264 Intra-mode requires a large computational effort and imposes severe challenges when targeting Quad-FullHD 25 fps real-time video encoding at moderate operating frequencies (we target 150 MHz) and limited area budget. Therefore, in this work we address the strong sequential data dependencies within H.264 Intra-mode that restrict the parallelism and inhibit high resolution encoding by a) decoupling of DC and AC transform paths, b) cycle-budget aware mode prediction scheduling while c) being area efficient. Using our proposed techniques, Quad-FullHD (3840x2160) 28 fps video encoding is achieved at 150 MHz, making our architecture applicable for high definition recording.
This paper presents an ASP (application specific
processor) with 512-bit SIMD (Single Instruction Multiple Data)
and 192-bit VLIW (Very Long Instruction Word) architecture
optimized for wireless baseband processing. It employs optimized
architecture and address generation unit to accelerate the kernel
algorithms. Based on the ASP, a multi-core baseband processor
is developed which can work at 2x2 MIMO and 20 MHz physical
bandwidth configuration for LTE inner receiver and meet
requirements of Category 3 User Equipment (CAT3 UE).
Furthermore, a silicon implementation of the baseband processor
with 130nm CMOS technology is presented. Experimental results
show that the baseband processor provides 100 GOPS computing
ability at 117.6MHz.
Keywords - Application Specific Processor; VLIW; AGU; Baseband processor; LTE
High Efficiency Video Coding (HEVC/H.265) is an emerging standard for video compression that provides almost double compression efficiency at the cost of major computational complexity increase as compared to current industry-standard Advanced Video Coding (AVC/H.264). This work proposes a collaborative hardware and software scheme for complexity reduction in an HEVC Intra encoding system, with run-time adaptivity. Our scheme leverages video content properties which drive the complexity management layer (software) to generate a highly probable coding configuration. The intra prediction size and direction are estimated for the prediction unit which provides reduced computational-complexity. At the hardware layer, specialized coprocessors with enhanced reusability are employed as accelerators. Additionally, depending upon the video properties, the software layer administers the energy management of the hardware coprocessors. Experimental results show that a complexity reduction of up to 60 % and the energy reduction up to 42 % are achieved.
Forthcoming technology nodes are posing major
challenges on the manufacturing of reliable (real-time) systems:
process variations, accelerated degradation aging, as well as
external and internal noise are key examples. This paper focuses
on real-time systems reliability and analyzes the state-of-the-art
and the emerging reliability bottlenecks from three different
perspectives: technology, circuit/IP and full system.
Keywords - Circuit reliability, embedded real-time systems, dependable computing
Response time analysis, which determines whether timing guarantees are satisfied for a given system, has matured to industrial practice and is able to consider even complex activation patterns modelled through arrival curves or minimum distance functions. On the other side, sensitivity analysis, which determines bounds on parameter variations under which constraints are still satisfied, is largely restricted to variation of single-valued parameters as e.g. task periods. In this paper we provide a sensitivity analysis to determine the bounds on the admissible activation pattern of a task, modelled through a minimum distance function. In an evaluation on a set of synthetic testcases we show, that the proposed algorithm provides significantly tighter bounds, than previous exact analyses, that determine allowable parametrizations of activation patterns.
Mixed-Criticality Scheduling (MCS) is an effective approach to addressing diverse certification requirements of safety-critical systems that integrate multiple subsystems with different levels of criticality. Preemption Threshold Scheduling (PTS) is a well-known technique for controlling the degree of preemption, ranging from fully-preemptive to fully-non-preemptive scheduling. We present schedulability analysis algorithms to enable integration of PTS with MCS, in order to bring the rich benefits of PTS into MCS, including minimizing the application stack space requirement, reducing the number of runtime task preemptions, and improving schedulability.
To address the service abrupt problem for low-criticality tasks in existing mixed-criticality scheduling algorithms, we study an Elastic Mixed-Criticality (E-MC) task model, where the key idea is to have variable periods (i.e., service intervals) for low-criticality tasks. The minimum service requirement of a low-criticality task is ensured by its largest period. However, at runtime low-criticality tasks can be released early by exploiting the slack time generated from the over-provisioned execution time for high-criticality tasks to reduce their service intervals and thus improve their service levels. We propose an Early-Release EDF (ER-EDF) scheduling algorithm, which can judiciously manage the early release of low-criticality tasks without affecting the timeliness of high-criticality tasks. Compared to the state-of-the-art EDF-VD scheduling algorithm, our simulation results show that the ER-EDF can successfully schedule much more task sets. Moreover, the achieved execution frequencies of low-criticality tasks can also be significantly improved under ER-EDF.
For more than one decade, researchers have considered Ethernet as a natural replacement to legacy fieldbuses in modern distributed applications. However, Ethernet components require special modifications and hardware support to provide strict timing guarantees. In general, the high-cost of deploying hardware components limits the experimental validation of proposed solutions in real-world applications. Despite the vast literature, only a few solutions report real implementations, and they are all closed to the research community, hindering further development for constantly evolving applications. This paper introduces Atacama, an on-going effort on deploying the first hardware-accelerated and open-source framework for mixed-criticality communication on multi-hop networks. Specialized modules exploit the principles of traditional fieldbus systems to coordinate communication tasks on real-time stations, and can be easily integrated to and coexist with Commercial Off The Shelf (COTS) devices operating with best-effort traffic. Experimental characterization of implemented prototypes report minimal jitter on 1Gbps links, and show that real-time guarantees are resilient to injected best-effort traffic. The framework is available as an open-source project, enabling researchers to verify the results, explore, test, and deploy new networking solutions for modern distributed systems in real-world scenarios.
We explore the potential of subsystem-based design
to reduce cost and time-to-market in the design of advanced
Systems-on-Chips (SoCs) while retaining low-power and high
performance processing. Using a concrete audio subsystem as an
example, we illustrate the benefits of modular SoC integration
with subsystems and identify challenges to be addressed. Well-designed
subsystems pre-integrate hardware and software
modules to implement complete system functions and offer high-level
hardware and software interfaces for easy SoC integration.
Configurability of subsystems enables reuse across SoCs.
Subsystems can offer software plug-ins to support integration
into a software stack on a host processor while making core
crossings transparent for the application programmer. We
conclude that subsystems can indeed be the next reuse paradigm
for efficient SoC integration.
Keywords - System-on-Chip (SoC); subsystem; audio;
Configurability in IP subsystems has two major
motivations. The first is the requirements of the IP subsystem
itself; the second the particular customer requirements, as every
customer has unique things they want to change in a subsystem.
Configurability manifests itself at two levels - the individual
components, such as processors (ideally configurable), memories,
and hardware blocks for specialized processing; the second one at
the subsystem level, where component choices, interconnect and
interfaces may all vary considerably. This paper discusses these
concepts applied to practical, real, baseband subsystems for
wireless communications. Configurability allows both scalability
of a reference IP subsystem - e.g. to handle varieties of standards
and use cases; and differentiation, so that customers get the
optimal IP subsystem for their unique needs. This is illustrated
with existing product-ready systems and cores, and future
subsystem concepts that will allow even better scalability,
performance, and adaptability for the next generation.
Keywords - IP, IP Substems, configurable processors, configurable subsystems, baseband
The availability of protocol features with iterative configurability is central to the successful adoption of reusable IP in SoC development. However, the promise of ultimately shrinking the SoC development TTM whilst also allowing greater resourcing efficiency can only be realized with a comprehensive approach to delivering the software, digital and analog components of the protocol to the SoC top level integration as IP subsystems with the correct integration views. This talk will discuss quantitatively how the combination of configurability, quality and integration at the IO protocol can systematically reduce the SoC development and resource plan. It will be demonstrated with examples for DDR and PCIe IO protocols as well as examples from application specific SoC's.
Within today's SoCs, functionality such as video, audio, graphics, and imaging is increasingly integrated through IP blocks, which are subsystems in their own right. Integration of IP blocks within SoCs always brought software integration aspects with it. However, since these subsystems increasingly consist of programmable processors, many more layers of firmware and software need to be integrated. In the imaging domain, this is particularly true. Imaging subsystems typically are highly heterogeneous, with high levels of parallelism. The construction of their firmware requires target-specific optimization, yet needs to take interoperability with sensor input systems and graphics/display subsystems into account. Hard real-time scheduling within the subsystem needs to cooperate with less stringent image analytics and SoC-level (OS) scheduling. In many of today's systems, the latter often only supports soft scheduling deadlines. At HW level, IP subsystems need to be integrated such that they can efficiently exchange both short-latency control signals and high-bandwidth data-plane blocks. Solutions exist, but need to be properly configured. However, at the SW level, currently no support exists that provides (i) efficient programmability, (ii) SW abstraction of all the different HW features of these blocks, and (iii) interoperability of these blocks. Starting points could be languages such as OpenCL and OpenCV, which do provide some abstractions, but are not yet sufficiently versatile.
Thirty-two years ago, Electronics Magazine honored Carver Mead and Lynn Conway with its Achievement Award for their contributions to VLSI chip design. The "Mead & Conway methods" were being taught at 100+ universities all over the world, and "not only have helped spawn a common design culture so necessary in the VLSI era, but have greatly increased interaction between university and industry so as to stimulate research by both." Concepts such as simplified design methods, new, electronic representations of digital design data, scalable design rules, "clean" formalized digital interfaces between design and manufacturing, and widely accessible silicon foundries suddenly enabled many thousands of chip designers to create many tens of thousands of chip designs. Today, as Moore's Law - a term coined by Carver Mead - has brought as from 10 microns to 10 nanometers, what is the heritage of Mead & Conway? UCB Professor Alberto Sangiovanni-Vincentelli will moderate an industry and research panel, to discuss what has remained the same, what was missed, what has changed, and what lies ahead.
As integrated circuits continuously scale up, process variation plays an increasingly significant role in system design and semiconductor economic return. In this paper, we explore the potential of profit improvement under the inherent semiconductor variability based on the speed binning technique. We first accordingly propose a set of high level synthesis techniques, including allocation, scheduling and resource binding, thus essentially constructing designs that maximize the number of chips that can be sold at the most advantageous price, leading to the maximization of the overall profit. We explore subsequently the optimal bin placement strategy for further profit improvement. Experimental results confirm the superiority of the high level synthesis results and the associated improvement in profit margins.
We propose a novel custom instruction (CI) selection technique for process variation and transistor aging aware instruction-set architecture synthesis. For aggressive clocking, we select CIs based on statistical static timing analysis (SSTA), which achieves efficient speedup during target lifetime while mitigating degradation of timing yield (i.e., probability of satisfying the timing). Furthermore, we consider process variation and aging on not only CIs but also basic instructions (BIs). Even if basic functional units (BFUs), e.g., ALU, get slower due to aging, only a few BIs with critical propagation delay may violate the timing, whereas the other BIs running on the same BFU can still satisfy the timing. We then introduce "customized BFUs", which execute only such aging-critical BIs. The customized BFUs, used as spare BFUs of the aging-critical BIs, can extend lifetime of the system. Combining the two approaches enables speedup as well as lifetime extension with no or negligibly small area/power overhead. Experiments demonstrate that our work outperforms conventional worst-case work (by an average speedup of about 49%) and existing SSTA-based work (16x or more lifetime extension with comparable speedup).
Multispeculative Functional Units (MSFUs) are
arithmetic functional units that operate using several predictors
for the carry signal. The carry prediction helps to shorten the
critical path of the functional unit. The average performance of
these units is determined by the hit rate of the prediction. In spite
of utilizing more than one predictor, none or only one additional
cycle is enough for producing the correct result in the majority of
the cases. In this paper we present multispeculation as a way of
increasing the performance of tree structures with a negligible
area penalty. By judiciously introducing these structures into
computation trees, it will only be necessary to predict in certain
selected nodes, thus minimizing the number of operations that
can potentially mispredict. Hence, the average latency will be
diminished and thus performance will be increased. Our
experiments show that it is possible to improve on average 24%
and 38% execution time, when considering logarithmic and
linear modules, respectively.
Index Terms - Speculation, operation trees, High-Level Synthesis.
Resource sharing is a classic high-level synthesis (HLS) optimization that saves area by mapping multiple operations to a single functional unit. With resource sharing, only operations scheduled in separate cycles can be assigned to shared hardware, which can result in longer schedules. In this paper, we propose a new approach to resource sharing that allows multiple operations to be performed by a single functional unit/ in one clock cycle. Our approach is based on multi-pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2x, allowing multiple computations to complete in a single system cycle. Our approach is particularly effective for DSP blocks on an FPGA, which are used to perform multiply and/or accumulate operations. Our results show that resource sharing using multi-pumping is comparable to traditional resource sharing in terms of area saved, but provides significant performance advantages. Specifically, when targeting a 50% reduction in DSP blocks, traditional resource sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance.
In this work, we study the problem of optimizing the datapath under resource constraint in the high-level synthesis of Application-Specific Instruction Processor (ASIP). We propose a two-level dynamic programming (DP) based heuristic algorithm. At the inner level of the proposed algorithm, the instructions are sorted in topological order, and then a DP algorithm is applied to optimize the topological order of the datapath. At the outer level, the space of the topological order of each instruction is explored to iteratively improve the solution. Compared with an optimal brutal-force algorithm, the proposed algorithm achieves near-optimal solution, with only 3% more performance overhead on average but significant reduction in runtime. Compared with a greedy algorithm which replaces the DP inner level with a greedy heuristic approach, the proposed algorithm achieves 48% reduction in performance overhead.
As semiconductor fabrics scale closer to fundamental physical limits, their reliability is decreasing due to process variation, noise margin effects, aging effects, and increased susceptibility to soft errors. Reliability can be regained through redundancy, error checking with recovery, voltage scaling and other means, but these techniques impose area/energy costs. Since some applications (e.g. media) can tolerate limited computation errors and still provide useful results, error-tolerant computation models have been explored, with both the application and computation fabric having stochastic characteristics. Stochastic computation has, however, largely focused on application-specific hardware solutions, and is not general enough to handle arbitrary bit errors that impact memory addressing or control in processors. In response, this paper addresses requirements for error-tolerant execution by proposing and evaluating techniques for running error-tolerant software on a general-purpose processor built from an unreliable fabric. We study the minimum error-protection required, from a microarchitecture perspective, to still produce useful results at the application output. Even with random errors as frequent as every 250μs, our proposed design allows JPEG and MP3 benchmarks to sustain good output quality - 14dB and 7dB respectively. Overall, this work establishes the potential for error-tolerant single-threaded execution, and details its required hardware/system support.
Voltage emergencies have become a major challenge to multi-core processors because core-to-core resonance may put all cores into danger which jeopardizes system reliability. We observed that the applications following SPMD (Single Program and Multiple Data) programming model tend to spark domain-wide voltage resonance because multiple threads sharing the same function body exhibit similar power activity. When threads are judiciously relocated among the cores, the voltage droops can be greatly reduced. We propose "Orchestrator", a sensor-free non-intrusive scheme for multi-core architectures to smooth the voltage droops. Orchestrator focuses on the inter-core voltage interactions, and maximally leverages the thread diversity to avoid voltage droops synergy among cores. Experimental results show that Orchestrator can reduce up to 64% voltage emergencies on average, meanwhile improving performance.
This work introduces Check-on-Write: a memory array error protection approach that enables a trade-off between a memory array's fault-coverage and energy. The presented approach checks for error in a value stored in an array before it is overwritten rather than, as currently done, when it is read (check-on-read). This aims at reducing the number and energy of error code checks. This lazy protection approach can be used for caches in systems that support failure-atomicity to recover from corrupted state due to a fault. The paper proposes and evaluates an adaptive memory protection scheme that is capable of both check-on-read and check-on-write and switches between the two protection modes depending on the energy to be saved and fault coverage requirements. Experimental analysis shows that our technique reduces the average dynamic energy of the L1 instruction cache tag and data arrays by 18.6% and 17.7% respectively. For the L1 data cache, this is 17.2% and 2.9%, and the savings are 13.4% for the L2 tag array. The paper also quantifies the implications of the proposed scheme on fault-coverage by analyzing the meantime-to-failure as a function of the transient failure rate.
Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.
On a Multi-Level Cell (MLC) flash memory, a flash block that is becoming unreliable to store multiple bits per cell can be "revived" by storing only a single bit per cell. While the revived-block capacity is halved, its lifetime is significantly extended without jeopardizing the stored data. We present Phoenix, a technique that benefits from this feature to extend a device lifetime, and we evaluate its potential through detailed trace simulation on realistic benchmarks. Phoenix shows systematic lifetime extensions ranging from 3% up to 17%, without extra memory requirements or performance loss.
Compact thermal models and modeling strategies are today a cornerstone for advanced power management to counteract the emerging thermal crisis for many-core systems-on-chip. System identification techniques allow to extract models directly from the target device thermal response. Unfortunately, standard Least Squares techniques cannot effectively cope with both model approximation and measurement noise typical of real systems. In this work, we present a novel distributed identification strategy capable of coping with real-life temperature sensor noise and effectively extracting a set of low-order predictive thermal models for the tiles of Intel's Single-chip-Cloud-Computer (SCC) many-core prototype.
JEDEC recently introduced its new standard for 3D-stacked Wide I/O DRAM memories, which defines their architecture, design, features and timing behavior. With improved performance/power trade-offs over previous generation DRAMs, Wide I/O DRAMs provide an extremely energy-efficient green memory solution required for next-generation embedded and high-performance computing systems. With both industry and academia pushing to evaluate and employ these highly anticipated memories, there is an urgent need for an accurate power model targeting Wide I/O DRAMs that enables their efficient integration and energy management in DRAM stacked SoC architectures. In this paper, we present the first system-level power model of 3D-stacked Wide I/O DRAM memories that is almost as accurate as detailed circuit-level power models of 3D-DRAMs. To verify its accuracy, we experimentally compare its power and energy estimates for different memory workloads and operations against those of a circuit-level 3D-DRAM power model and show less than 2% difference between the two sets of estimates.
A case study exploring multi-frequency design is
presented for a low energy and high performance FFT circuit implementation.
An FFT architecture with concurrent data stream
computation is selected. An asynchronous and synchronous
implementations for a 16-point and a 64-point FFT circuit were
designed and compared for energy, performance and area. Both
versions are structurally similar and are generated using similar
ASIC CAD tools and flows. The asynchronous design shows a
benefit of 2.4x, 2.4x and 3.2x in terms of area, energy and
performance respectively over its synchronous counterpart. The
circuit is further compared with other published designs and
shows 0.4x, 4.8x and 32.4x benefit with respect to area, energy
Index Terms - Asynchronous circuits, FFT, synthesis, timing analysis, low power digital, low energy digital, synchronous circuits, high performance
The increasing demand for fast and accurate product pricing and risk computation together with high energy costs currently make finance and insurance institutes to rethink their IT infrastructure. Heterogeneous systems including specialized accelerator devices are a promising alternative to current CPU and GPU-clusters towards hardware accelerated computing. It has already been shown in previous work that complex state-of-the-art computations that have to be performed very frequently can be sped up by FPGA accelerators in a highly efficient way in this domain. A very common task is the pricing of credit derivatives, in particular options, under realistic market models. Monte Carlo methods are typically employed for complex or path dependent products. It has been shown that the multi-level Monte Carlo can provide a much better convergence behavior than standard single-level methods. In this work we present the first hardware architecture for pricing European barrier options in the Heston model based on the advanced multi-level Monte Carlo method. The presented architecture uses industry-standard AXI4-Stream flow control, is constructed in a modular way and can be extended to more products easily. We show that it computes around 100 millions of steps in a second with a total power consumption of 3.58 W on a Xilinx Virtex-6 FPGA.
In the past few years, many techniques have been introduced which try to utilize excessive timing margins of a processor. However, these techniques have limitations due to one of the following reasons: first, they are not suitable for high-performance processor designs due to the power and design overhead they impose; second, they are not accurate enough to effectively exploit the timing margins, requiring substantial safety margin to guarantee correct operation of the processor. In this paper, we introduce an alternative, more effective technique that is suitable for high-performance processor designs, in which a processor predicts timing errors in the critical paths and undertakes preventive steps in order to avoid the errors in the event that the timing margins fall below a critical level. This technique allows a processor to exploit timing margins, while only requiring the minimum safety margin. Our simulation results show that proposed idea results in 12% and 6% improvement in energy and Energy-Delay Product (EDP), respectively, over a Razor-based speculative method.
We propose a methodology to design user-aware streaming strategies
for energy efficient smartphone video playback applications (e.g.
YouTube). Our goal is to manage the streaming process to minimize
the sleep and wake penalty of cellular module and at the same time
avoid the energy waste from excessive downloading. The problem is
modeled as a stochastic inventory system, where the real length of
video playback requested by the smartphone user is considered as
demand that follows a stochastic process. Through user behavior
analysis, a Gaussian Mixture Model (GMM) is constructed to predict
the user demand in video playback, and then an energy efficient
video downloading strategy will be determined progressively during
the playback process. Experimental results show that compared to a
static downloading strategy that is optimized by exhaustive trail, our
method can reduce the wasted energy by 10 percent in average.
Key words: smartphone, video download, 3G, energy, Inventory Theory, Gaussian Mixture Model
Distributed computing resources in a cloud computing environment provides an opportunity to reduce energy and its cost by shifting loads in response to dynamically varying availability of energy. This variation in electrical power availability is represented in its dynamically changing price that can be used to drive workload deferral against performance requirements. But such deferral may cause user dissatisfaction. In this paper, we quantify the impact of deferral on user satisfaction and utilize flexibility from the service level agreements (SLAs) for deferral to adapt with dynamic price variation. We differentiate among the jobs based on their requirements for responsiveness and schedule them for energy saving while meeting deadlines and user satisfaction. Representing utility as decaying functions along with workload deferral, we make a balance between loss of user satisfaction and energy efficiency. We model delay as decaying functions and guarantee that no job violates the maximum deadline, and we minimize the overall energy cost. Our simulation on MapReduce traces show that energy consumption can be reduced by ~15%, with such utility-aware deferred load balancing. We also found that considering utility as a decaying function gives better cost reduction than load balancing with a fixed deadline.
Reducing the energy consumption for computation and cooling in servers is a major challenge considering the data center energy costs today. To ensure energy-efficient operation of servers in data centers, the relationship among computational power, temperature, leakage, and cooling power needs to be analyzed. By means of an innovative setup that enables monitoring and controlling the computing and cooling power consumption separately on a commercial enterprise server, this paper studies temperature-leakage-energy tradeoffs, obtaining an empirical model for the leakage component. Using this model, we design a controller that continuously seeks and settles at the optimal fan speed to minimize the energy consumption for a given workload. We run a customized dynamic load-synthesis tool to stress the system. Our proposed cooling controller achieves up to 9% energy savings and 30W reduction in peak power in comparison to the default cooling control scheme.
As CMOS technologies enter nanometer scales, microprocessors become more vulnerable to transistor aging mainly due to Bias Temperature Instability and Hot Carrier Injection. These phenomena lead to increasing device delays during the operational lifetime, which results in increasing pipeline stage delays. However, the aging rates of different stages are different. Hence, a previously delay-balanced pipeline becomes increasingly imbalanced resulting in a non-optimized design in terms of Mean Time to Failure (MTTF), frequency, area and power consumption. In this paper, we propose an MTTF-balanced pipeline design, in which the pipeline stage delays are balanced after the desired lifetime rather than at design time. This can lead to significant MTTF (lifetime) improvements as well as additional performance, area, and power benefits. Our experimental results show that MTTF of the FabScalar microprocessor can be improved by 2x (or frequency by 3 %) while achieving an additional 4% power, and 1% area optimization.
Increasing parameter variations, caused by variations in process, temperature, power supply, and wear-out, have emerged as one of the most important challenges in semiconductor manufacturing and test. As a consequence for gate delay testing, a single test vector pair is no longer sufficient to provide the required low test escape probabilities for a single delay fault. Recently proposed statistical test generation methods are therefore guided by a metric, which defines the probability of detecting a delay fault with a given test set. However, since runtime and accuracy are dominated by the large number of required metric evaluations, more efficient approximation methods are mandatory for any practical application. In this work, a new statistical dynamic timing analysis algorithm is introduced to tackle this problem. The associated approximation error is very small and predominantly caused by the impact of delay variations on path sensitization and hazards. The experimental results show a large speedup compared to classical Monte Carlo simulations.
In situ monitoring is an accurate way to monitor circuit delay or timing slack, but usually incurs significant overhead. We observe that most existing slack monitoring methods exclusively focus on monitoring path ending registers, which is not cost efficient from power and area perspectives. In this paper, we propose SlackProbe methodology, which inserts timing slack monitors like "probes" at a selected set of nets, including intermediate nets along critical paths. SlackProbe can significantly reduce the total number of monitors required at the cost of some additional delay margin. It can be used to detect impending delay failures due to various reasons (process variations, ambient fluctuations, circuit aging, etc.) and can be used with various preventive actions (e.g. voltage/frequency scaling, clock stretching/time borrowing, etc.). Though we focus on monitor selection in this work, we give an example of using SlackProbe with adaptive voltage scaling. Experimental results on commercial processors show that with 5% more timing margin, SlackProbe can reduce the number of monitors by 15-18X as compared to the number of monitors inserted at path ending pins.
With aggressive device scaling, the impact of
parameter variation is becoming more prominent, which results
in the uncertainty of a chip's performance. Techniques that
capture post-silicon variation by deploying on-chip monitors
suffer from serious area overhead and low testing reliability,
while techniques using non-invasion test are limited in small scale
circuits. In this paper, a novel layout-aware post-silicon variation
extraction method which is based on non-invasive path-delay test
is proposed. The key technique of the proposed method is a novel
layout-aware heuristic path selection algorithm which takes the
spatial correlation and linear dependence between paths into
consideration. Experimental results show that the proposed
technique can obtain an accurate timing variation distribution
with zero area overhead. Moreover, the test cost is much smaller
than the existing non-invasion method.
Keywords: variation extraction, path selection, path-delay testing, layout-aware
Increasing process variations, coupled with the need for highly adaptable circuits, bring about tough new challenges in terms of circuit testing. Circuit adaptation for process and workload variability require costly characterization/test cycles for each chip, in order to extract particular Vdd/fmax behavior of the die under test. This paper aims at adaptively reducing the search space for fmax at multiple levels by reusing the information previously obtained from the DUT during test-time. The proposed adaptive solution reduces the test/characterization time and costs at no area or test overhead.
Although most previous work in cache analysis for WCET estimation assumes the LRU replacement policy, in practise more processors use simpler non-LRU policies for lower cost, power consumption and thermal output. This paper focuses on the analysis of FIFO, one of the most widely used cache replacement policies. Previous analysis techniques for FIFO caches are based on the same framework as for LRU caches using qualitative always-hit/always-miss classifications. This approach, though works well for LRU caches, is not suitable to analyze FIFO and usually leads to poor WCET estimation quality. In this paper, we propose a quantitative approach for FIFO cache analysis. Roughly speaking, the proposed quantitative analysis derives an upper bound on the "miss ratio" of an instruction (set), which can better capture the FIFO cache behavior and support more accurate WCET estimations. Experiments with benchmarks show that our proposed quantitative FIFO analysis can drastically improve the WCET estimation accuracy over previous techniques (the average overestimation ratio is reduced from around 70% to 10% under typical setting).
Many real-time embedded systems execute multi-mode applications, i.e. applications that can change their functionality over time. With the advent of multi-core embedded architectures, the system design process requires appropriate support for accommodating multi-mode applications on multiple cores which share common resources. Various mode change and resource arbitration protocols, and corresponding timing analysis solutions were proposed for either multi-mode or multi-core real-time applications. However, no attention was given to multi-mode applications that share resources when executing on multi-core systems. In this paper, we address this subject in the context of automotive multi-core processors using AUTOSAR. We present an approach for safely handling shared resources across mode changes and provide a corresponding timing analysis method.
The transition towards multi-processor systems with shared resources is challenging for real-time systems, since resource interference between con- current applications must be bounded using timing analysis. There are two common approaches to this problem: 1) Detailed analysis that models the particular resource and arbiter cycle-accurately to achieve tight bounds. 2) Using temporal abstractions, such as latency-rate (LR) servers, to enable unified analysis for different resources and arbiters using well-known timing analysis frameworks. However, the use of abstraction typically implies reducing the tightness of analysis that may result in over-dimensioned systems, although this pessimism has not been properly investigated. This paper compares the two approaches in terms of worst-case execution time (WCET) of applications sharing an SDRAM memory under Credit-Controlled Static-Priority (CCSP) arbitration. The three main contributions are: 1) A detailed interference analysis of the SDRAM memory and CCSP arbiter. 2) Based on the detailed analysis, two optimizations are proposed to the LR analysis that increase the tightness of its interference bounds. 3) An experimental comparison of the two approaches that quantifies their impact on the WCET of applications from the CHStone benchmark.
Multiple-patterning optical lithography is inevitable for technology scaling beyond the 22nm technology node. Multiple patterning imposes several counter-intuitive restrictions on layout and carries serious challenges for design methodology. This paper examines the role of design at different stages of the development and adoption of multiple patterning: technology development, design enablement, and process control. We discuss how explicit design involvement can enable timely adoption of multi-patterning with reduced costs both in design and manufacturing.
Existing post-silicon validation techniques are generally ad hoc,
and their cost and complexity are rising faster than design cost.
Hence, systematic approaches to post-silicon validation are essential.
Our research indicates that many of the bottlenecks of existing post-silicon
validation approaches are direct consequences of very long
error detection latencies. Error detection latency is the time elapsed
between the activation of a bug during post-silicon validation and its
detection or manifestation as a system failure. In our earlier papers,
we created the Quick Error Detection (QED) technique to overcome
this significant challenge. QED systematically creates a wide variety
of post-silicon validation tests to detect bugs in processor cores and
uncore components of multi-core System-on-Chips (SoCs) very
quickly, i.e., with very short error detection latencies. In this paper,
we present an overview of QED and summarize key results: 1. Error
detection latencies of "typical" post-silicon validation tests can range
up to billions of clock cycles. 2. QED shortens error detection
latencies by up to 6 orders of magnitude. 3. QED enables 2- to 4-fold
improvement in bug coverage. QED does not require any hardware
modification. Hence, it is readily applicable to existing designs.
Keywords - Debug, Post-Silicon Validation, Quick Error Detection, Testing, Verification
Reliability is one of the major concerns in
designing integrated circuits in nanometer CMOS technologies.
Problems related to transistor degradation mechanisms like
NBTI/PBTI or soft gate breakdown cause time-dependent circuit
performance degradation. Variability and mismatch between
transistors only makes this more severe, while at the same time
transistor aging can increase the variability and mismatch in the
circuit over time. Finally, in advanced nanometer CMOS, the
aging phenomena themselves become discrete, with both the time
and the impact of degradation being fully stochastic. This paper
explores these problems by means of a circuit example, indicating
the time-dependent stochastic nature of offset in a comparator
and its impact in flash A/D converters.
Keywords - analog integrated circuits; aging; reliability modeling and simulation
Asynchronous networks-on-chip (NoCs) are an appealing solution to tackle the synchronization challenge in modern multicore systems through the implementation of a GALS paradigm. However, they have found only limited applicability so far due to two main reasons: the lack of proper design tool flows as well as their significant area footprint over their synchronous counterparts. This paper proposes a largely unexplored design point for asynchronous NoCs, relying on transition-signaling bundled data, which contributes to break the above barriers. Compared to an existing lightweight synchronous switch architecture, xpipesLite, the post-layout asynchronous switch achieved a 71% reduction in area, up to 85% reduction in overall power consumption, and a 44% average reduction in energy-per-flit, while mastering the more stringent timing assumptions of this solution with a semi-automated synthesis flow.
As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.
On-chip interconnection networks simplify the increasingly challenging process of integrating multiple functional modules in modern Systems-on-Chip (SoCs). The routers are the heart and backbone of such networks, and their implementation cost (area/power) determines the cost of the whole network. In this paper, we explore the time-multiplexing of a router's output ports via a folded datapath and control, where only a portion of the router's arbiters and crossbar multiplexers are implemented, as a means to reduce the cost of the router without sacrificing performance. In parallel, we propose the incorporation of the switch-folded routers into a new form of heterogeneous network topologies, comprising both folded (time-multiplexed) and unfolded (conventional) routers, which leads to effectively the same network performance, but at lower area/energy, as compared to topologies composed entirely of full-fledged wormhole or virtual-channel-based router designs.
In this paper, we propose a scheme for reducing the latency of packets transmitted via on-chip interconnect network in MultiProcessor Systems on Chips (MPSoCs). In this scheme, the network architecture separates the packets transmitted to near destinations from those transmitted to distant ones by using two network layers. These two layers are realized by dividing the channel width among the cores. The optimum ratio for the channel width division is a function of relative significances of the two types of communications. Simulation results indicate that for non-uniform traffic constituting of more than 30 percent local traffic, the proposed network, on average provides 64% and 70% improvement over the conventional one in terms of average network latency and Energy-Delay product (EDP), respectively. Also, for uniform and NED traffic patterns, by adjusting the number of hops between local nodes to include approximately 55 percent of total communications in local ones, the proposed architecture provides the latency reduction of 50%.
In this work, we propose SVR-NoC, a learning-based
support vector regression (SVR) model for evaluating
Network-on-Chip (NoC) latency performance. Different from the
state-of-the-art NoC analytical model, which uses classical queuing
theory to directly compute the average channel waiting time,
the proposed SVR-NoC model performs NoC latency analysis
based on learning the typical training data. More specifically,
we develop a systematic machine-learning framework that uses
the kernel-based support vector regression method to predict
the channel average waiting time and the traffic flow latency.
Experimental results show that SVR-NoC can predict the average
packet latency accurately while achieving about 120X speed-up
over simulation-based evaluation methods.
Index Terms - Network-on-Chip, learning, performance model
In safety related applications and in products with long lifetimes reliability is a must. Moreover, facing future technology nodes of integrated circuit device level reliability may decrease, i.e., counter-measures have to be taken to ensure product level reliability. But assessing the reliability of a large system is not a trivial task. This paper revisits the state-of-the-art in reliability evaluation starting from the physical device level, to the software system level, all the way up to the product level. Relevant standards and future trends are discussed.
This paper addresses the dynamic energy consumption in L1 data cache interfaces of out-of-order superscalar processors. The proposed Multiple Access Low Energy Cache (MALEC) is based on the observation that consecutive memory references tend to access the same page. It exhibits a performance level similar to state of the art caches, but consumes approximately 48% less energy. This is achieved by deliberately restricting accesses to only 1 page per cycle, allowing the utilization of single-ported TLBs and cache banks, and simplified lookup structures of Store and Merge Buffers. To mitigate performance penalties it shares memory address translation results between multiple memory references, and shares data among loads to the same cache line. In addition, it uses a Page-Based Way Determination scheme that holds way information of recently accessed cache lines in small storage structures called way tables that are closely coupled to TLB lookups and are able to simultaneously service all accesses to a particular page. Moreover, it removes the need for redundant tag-array accesses, usually required to confirm way predictions. For the analyzed workloads, MALEC achieves average energy savings of 48% in the L1 data memory subsystem over a high performance cache interface that supports up to 2 loads and 1 store in parallel. Comparing MALEC and the high performance interface against a low power configuration limited to only 1 load or 1 store per cycle reveals 14% and 15% performance gain requiring 22% less and 48% more energy, respectively. Furthermore, Page-Based Way Determination exhibits coverage of 94%, which is a 16% improvement over the originally proposed line-based way determination.
NAND flash memory is widely used for secondary storage today. The flash translation layer (FTL) is the embedded software that is responsible for managing and operating in flash storage system. One important module of the FTL performs RAM management. It is well-known to have a significant impact on flash storage system's performance. This paper proposes an efficient RAM management scheme called TreeFTL. As the name suggests, TreeFTL organizes address translation pages and data pages in RAM in a tree structure, through which it dynamically adapts to workloads by adjusting the partitions for address mapping and data buffering. TreeFTL also employs a lightweight mechanism to implement the least recently used (LRU) algorithm for RAM cache evictions. Experiments show that compared to the two latest schemes for RAM management in flash storage system, TreeFTL can reduce service time by 46.6% and 49.0% on average, respectively, with a 64MB RAM cache.
Program disturb, read disturb and retention time limit are three major reasons accounting for the bit errors in NAND flash memory. The adoption of multi-level cell (MLC) technology and technology scaling further aggravates this reliability issue by narrowing threshold voltage noise margins and introducing larger device variations. Besides implementing error correction code (ECC) in NAND flash modules, RAID-5 are often deployed at system level to protect the data integrity of NAND flash storage systems (NFSS), however, with significant performance degradation. In this work, we propose a technique called "DA-RAID-5" to improve the performance of the enterprise NFSS under RAID-5 protection without harming its reliability (here DA stands for "disturb aware"). Three schemes, namely, unbound-disturb limiting (UDL), PE-aware RAID-5 and Hybrid Caching(HC) are proposed to protect the NFSS at the different stages of its lifetime. The experimental results show that compared to the best prior work, DA-RAID-5 can improve the NFSS response time by 9.7% on average.
Enabling subarrays reduces memory latency by allowing concurrent accesses to different subarrays within the same bank in the DRAM system. However, this technology has great challenges in the PCM system since an on-going write cannot overlap with other accesses due to large electric current draw for writes. This paper proposes two new mechanisms (PASAK and WAVAK) that leverage subarray-level parallelism to enable a bank to serve a write and multiple reads in parallel without violating power constraints. PASAK exploits the electric current difference between writing a bit 0 and a bit 1, and provides a new power allocation strategy that better utilizes the power budget to mitigate the performance degradation due to bank conflicts. WAVAK adds a simple coding method that inverts all bits to be written if there are more zeros than ones, with a goal to reduce electric current for writes and create larger power surplus to serve more reads if there is no subarray conflict. Experimental results under 4-cores SPEC CPU 2006 workloads show that our proposed mechanisms can reduce memory latency by 68.7% and running time by 34.8% on average, comparing with the standard PCM system. In addition, our mechanisms outperform Flip-N-Write 14.6% in latency and 8.5% in running time on average.
As graphics processing units (GPUs) are becoming increasingly popular for general purpose workloads (GPGPU), the question arises how such processors will evolve architecturally in the near future. In this work, we identify and discuss trade-offs for three GPU architecture parameters: active thread count, compute-memory ratio, and cluster and warp sizing. For each parameter, we propose changes to improve GPU design, keeping in mind trends such as dark silicon and the increasing popularity of GPGPU architectures. A key-enabler is dynamism and workload-adaptiveness, enabling among others: dynamic register file sizing, latency aware scheduling, roofline-aware DVFS, run-time cluster fusion, and dynamic warp sizing.
Embedded biosignal analysis involves a considerable amount of parallel computations, which can be exploited by employing low-voltage and ultra-low-power (ULP) parallel computing architectures. By allowing data and instruction broadcasting, single instruction multiple data (SIMD) processing paradigm enables considerable power savings and application speedup, in turn allowing for a lower voltage supply for a given workload. The state-of-the-art multi-core architectures for biosignal analysis however lack a bare, yet smart, synchronization technique among the cores, allowing lockstep execution of algorithm parts that can be performed using the SIMD, even in the presence of data-dependent execution flows. In this paper, we propose a lightweight synchronization technique to enhance an ULP multi-core processor, resulting in improved energy efficiency through lockstep SIMD execution. Our results show that the proposed improvements accomplish tangible power savings, up to 64% for an 8-core system operating at a workload of 89 MOps/s while exploiting voltage scaling.
GPUs spend significant time on synchronization stalls. Such stalls provide ample opportunity to save leakage energy in GPU structures left idle during such periods. In this paper we focus on the register file structure of NVIDIA GPUs and introduce sync-aware low leakage solutions to reduce power. Accordingly, we show that applying the power gating technique to the register file during synchronization stalls can improve power efficiency without considerable performance loss. To this end, we equip the register file with two leakage power saving modes with different levels of power saving and wakeup latencies.
Fault tolerant software against fault attacks constitutes an important class of countermeasures for embedded systems. In this work, we implemented and systematically analyzed a comprehensive set of 19 different strategies for software countermeasures with respect to protection effectiveness as well as time and memory efficiency. We evaluated the performance and security of all implementations by fault injections into a microcontroller simulator based on an ARM Cortex-M3. Our results show that some rather simple countermeasures outperform other more sophisticated methods due to their low memory and/or performance overhead. Further, combinations of countermeasures show strong characteristics and can lead to a high fault coverage, while keeping additional resources at a minimum. The results obtained in this study provide developers of secure software for embedded systems with a solid basis to decide on the right type of fault attack countermeasure for their application.
This paper introduces a generic and automated methodology to protect hardware designs from side-channel attacks in a manner that is fully compatible with commercial standard cell design flows. The paper describes a tool that artificially adds jitter to the clocks of the sequential elements of a cryptographic unit, which increases the non-determinism of signal timing, thereby making the physical device more difficult to attack. Timing constraints are then specified to commercial EDA tools, which restore the circuit functionality and efficiency while preserving the introduced randomness. The protection scheme is applied to an AES-128 hardware implementation that is synthesized using both ASIC and FPGA design flows.
The silicon physical unclonable functions (PUF) utilize the uncontrollable variations during integrated circuit (IC) fabrication process to facilitate security related applications such as IC authentication. In this paper, we describe a new framework to generate secure PUF secret from ring oscillator (RO) PUF with improved hardware efficiency. Our work is based on the recently proposed group-based RO PUF with the following novel concepts: an entropy distiller to filter the systematic variation; a simplified grouping algorithm to partition the ROs into groups; a new syndrome coding scheme to facilitate error correction; and an entropy packing method to enhance coding efficiency and security. Using RO PUF dataset available in the public domain, we demonstrate these concepts can create PUF secret that can pass the NIST randomness and stability tests. Compared to other state-of-the-art RO PUF design, our approach can generate an average of 72% more PUF secret with the same amount of hardware.
Physical Unclonable Functions (PUFs) extract unique chip signatures from process variations. They are used in identification, authentication, integrity verification, and anti-counterfeiting tasks. We introduce new PUF techniques that extract bits from pairwise skews between sinks of a clock network. These techniques inherit the stability of clock network, but require a return network to deliver clock pulses to a certain region, where they are compared. Our algorithms select equidistant sinks and route the return network, then derive chip-specific random bits from available data with a moderate overhead. SPICE-based evaluation of clock-PUFs using a 45nm CMOS technology validates the operability, stability, uniqueness, randomness, and their low overhead.
Memristors are emerging as a potential candidate for next-generation memory technologies, promising to deliver non-volatility at performance and density targets which were previously the domain of SRAM and DRAM. Silicon Physically Unclonable Functions (PUFs) have been introduced as a relatively new security primitive which exploit manufacturing variation resulting from the IC fabrication process to uniquely fingerprint a device instance or generate device-specific cryptographic key material. While silicon PUFs have been proposed which build on traditional memory structures, in particular SRAM, in this paper we present a memristor-based PUF which utilizes a weak-write mechanism to obtain cell behaviour which is influenced by process variation and hence usable as a PUF response. Using a model-based approach we evaluate memristor PUFs under random process variations and present results on the performance of this new PUF variant.
During the last years, Wireless Sensor Networks
(WSN) have been deployed at an accelerated rate. The
complexity and low-power requirements of these networks have
also been growing. Therefore, WSN developers are beginning to
require efficient methodologies for network simulation and
embedded SW performance analysis. These tools should also
include security analysis. This security analysis has to evaluate
the vulnerability of a WSN to the wide variety of attacks that
these networks could suffer. WSN attacks could also affect power
consumption and performance of the node's software, thus
security analysis has to be integrated into a complete
performance analysis framework. This work proposes a
methodology to simulate the most common and dangerous
attacks that a WSN can suffer nowadays. The impact of these
attacks on power consumption and software execution time are
also analyzed. This provides developers with important
information about the effects that one or multiple attacks could
have on the WSN, helping them to develop more secure software.
Index Terms - WSN, Attack Simulation, Power Consumption, Performance Analysis, Security.
Unknown (X) values may emerge during the design
process as well as during system operation and test application.
Sources of X-values are for example black boxes, clock-domain
boundaries, analog-to-digital converters, or uncontrolled
or uninitialized sequential elements.
To compute a detecting pattern for a given stuck-at fault,
well defined logic values are required both for fault activation
as well as for fault effect propagation to observing outputs. In
presence of X-values, classical test generation algorithms, based
on topological algorithms or formal Boolean satisfiability (SAT)
or BDD-based reasoning, may fail to generate testing patterns
or to prove faults untestable.
This work proposes the first efficient stuck-at fault ATPG
algorithm able to prove testability or untestability of faults in
presence of X-values. It overcomes the principal inaccuracy and
pessimism of classical algorithms when X-values are considered.
This accuracy is achieved by mapping the test generation problem
to an instance of quantified Boolean formula (QBF) satisfiability.
The resulting fault coverage improvement is shown by experimental
results on ISCAS benchmark and larger industrial circuits.
Index Terms - Unknown values, test generation, ATPG, QBF
Low-power SRAMs embed mechanisms for reducing static power consumption.
When the SRAM is not accessed during a long period, it switches into an
intermediate low-power mode. In this mode, a voltage regulator is used to
reduce the voltage supplied to the core-cells as low as possible without
data loss. Thus, faulty-free behavior of the voltage regulator is crucial
for ensuring data retention in core-cells when the SRAM is in low-power mode.
This paper investigates the root cause of data retention faults due to voltage regulator
malfunctions. This analysis is done under realistic conditions (i.e., industrial
core-cells affected by process variations). Based on this analysis, we propose an
efficient test flow for detecting data retention faults in low-power SRAMs.
Keywords - SRAM, low-power design, test algorithm, memory test.
Comprehensive coverage of small-delay faults under massive process variations is achieved when multiple paths through the fault locations are sensitized by the test pair set. Using one test pair per path may lead to impractical test set sizes and test application times due to the large number of near-critical paths in state-of-the-art circuits. We present a novel SAT-based dynamic test-pattern compaction and relaxation method for sensitized paths in sequential and combinational circuits. The method identifies necessary assignments for path sensitization and encodes them as a SAT-instance. An efficient implementation of a bitonic sorting network is used to find test patterns maximizing the number of simultaneously sensitized paths. The compaction is combined with an efficient lifting-based relaxation technique. An innovative implication-based path-conflict analysis is used for a fast identification of conflicting paths. Detailed experimental results demonstrate the applicability and quality of the method for academical and industrial benchmark circuits. Compared to fault dropping the number of patterns is significantly reduced by over 85% on average while at the same time leaving more than 70% of the inputs unspecified.
Along with the shrinking CMOS process and rapid design scaling, both Iddq values and their variation of chips increase. As a result, the defect leakages become less significant when compared to the full-chip currents, making them more in-distinguishable for traditional Iddq diagnosis. Therefore, in this paper, a new approach called σ-Iddq diagnosis is proposed for reinterpreting original data and dianosing failing chips, intelligently. The overall flow consists of two key components, (1) σ-Iddq transformation and (2) defect-syndrome matching: σIddq transmation first manifests defect leakages by excluding both the process-variation and design-scaling impacts. Later, defect-syndrome matching applies data mining with a pre-built library to identify type and locations of defects on the fly. Experimental results show that an average of 93.68% accuracy with a resolution of 1.75 defect suspects can be achieved on ISCAS'89 and IWLS'05 benchmark circuits using a 45nm technology, demonstrating the effectivess of σ-Iddq diagnosis.
This paper is an introduction to security challenges for the design of automotive hardware/software architectures. State-of-the-art automotive architectures are highly heterogeneous and complex systems that rely on distributed functions based on electronics and software. As cars are getting more connected with their environment, the vulnerability to attacks is rapidly growing. Examples for such wireless communication are keyless entry systems, WiFi, or Bluetooth. Despite this increasing vulnerability, the design of automotive architectures is still mainly driven by safety and cost issues rather than security. In this paper, we present potential threats and vulnerabilities, and outline upcoming security challenges in automotive architectures. In particular, we discuss the challenges arising in electric vehicles, like the vulnerability to attacks involving tampering with the battery safety. Finally, we discuss future automotive architectures based on Ethernet/IP and how formal verification methods might be used to increase their security.
The performance of High Performance Computing (HPC) systems is already limited by their power consumption. The majority of top HPC systems today are built from commodity server components that were designed for maximizing the compute performance. The Mont-Blanc project aims at using low-power parts from the mobile domain for HPC. In this paper we present our first experiences with the use of mobile processors and accelerators for the HPC domain based on the research that was performed in the project. We show initial evaluation of NVIDIA Tegra 2 and Tegra 3 mobile SoCs and the NVIDIA Quadro 1000M GPU with a set of HPC microbenchmarks to evaluate their potential for energy-efficient HPC.
The next grail sought by HPC community is the exascale, 100 times the current scale. This target will not be reached easily as many challenges are uprising. The first challenge, the Energy consumption, has become a strict constraint now with a limit set to 20MW (twice as the current top supercomputers). Multiplying the computing elements will imply to drastically reduce the power consumption of each of them. The second challenge will be to keep it cool as: first the overall power envelope, 20MW, include the energy for cooling and second, because 20MW will be turned into heat by joule effect. And the operating temperature of electronic must be bounded otherwise, the leakage (and thus the power consumption) increases and the reliability decreases. This brings us to a third challenge regarding the reliability of the machine, the number of components will be tremendous, thus, the probability of having failing ones will increase. It has to be managed in such a way that applications will not be impacted by the failures. Finally, The last challenge is related to the software stack of these supercomputers, how will we manage billions of threads, how will we debug it, ... New paradigms are currently being studied, for instance Bag of tasks, that try to tackle these aspects. These are the challenges we have to solve!! In this presentation, brightened up with insight into Bull roadmap, we present a possible future.
The efficient and flexible management of large datasets is one of the core requirements of modern business applications. Having access to consistent and up-to-date information is the foundation for operational, tactical, and strategic decision making. Within the last few years, the database community sparked a large number of extremely innovative research projects to push the envelope in the context of modern database system architectures. In this paper, we outline requirements and influencing factors to identify some of the hot research topics in database management systems. We argue that "even after 30 years of active database research" the time is right to rethink some of the core architectural principles and come up with novel approaches to meet the requirements of the next decades in data management. The sheer number of diverse and novel (e.g., scientific) application areas, the existence of modern hardware capabilities, and the need of large data centers to become more energy-efficient will be the drivers for database research in the years to come.
This paper presents performance evaluation and analysis of well-known HPC applications and benchmarks running on low-power embedded platforms. The performance to power consumption ratios are compared to classical x86 systems. Scalability studies have been conducted on the Mont-Blanc Tibidabo cluster.We have also investigated optimization opportunities and pitfalls induced by the use of these new platforms, and proposed optimization strategies based on auto-tuning.
Replacing batteries in wireless sensor nodes by energy harvesting enables a maintenance-free operation and an increasing degree of miniaturization at the cost of higher power management efforts. The limited power capability of environmental sources requires a careful investigation of the different harvesting opportunities to find the optimal source in a specific application scenario. Promising resources in the automotive area are kinetic and thermoelectric based harvesters. In this talk physical properties of energy converters are analyzed to show their restrictions and allow power estimation. In addition examples of already established self-sufficient sensors are presented.
Energy harvesting has become a very popular
research topic over the last 12 years, but has only made an
industrial impact in a few areas, noticeably in process plant
monitoring, including the water and petrochemical processing
industries. Like most technologies, greater adoption needs to be
realized if performance is to increase and cost to decrease.
Batteries cost only tens of pence per Wh, and whilst harvesters
can in theory generate very large amount of energy over a long
enough period of operation, a typical harvester can require a
capital expenditure of tens to hundreds of pounds, making them
unattractive in many applications. The automotive sector is a
potential area in which harvesters could provide useful
functionality and gain from economies of scale, if they can be
made reliable enough with a high enough power density and
work well in a wide enough variety of scenarios. Recent work on
increasing the power density of energy harvesters has focused on
improving the power electronic interface, tuning the resonant
frequency of motion-driven harvesters and reducing the power
consumption of the load electronics.
Keywords - energy harvesting; adaptive systems; power density
Visions such as the internet of things require vast amount of sensors distributed in
our environment that strongly rely on circuits that are energy autonomous. However,
design of such circuits is a challenge that is currently done by experts only. The
challenge is to cope with circuit level design and even technology while designing an
application. Unfortunately, tools and methods that support cross-layer and cross-domain
optimizations are missing.
Keywords - ultra-low power, cross-layer optimization
An energy-harvester-powered wireless sensor node
is a complicated system with many design parameters. To
investigate the various trade-offs among these parameters, it is
desirable to explore the multi-dimensional design space quickly.
However, due to the large number of parameters and costly
simulation CPU times, it is often difficult or even impossible to
explore the design space via simulation. A design of experiment
(DoE) approach using the response surface model (RSM)
technique can enable fast design space exploration of a complete
wireless sensor node powered by a tunable energy harvester. As a
proof of concept, a software toolkit has been developed which
implements the DoE-based design flow and incorporates the
energy harvester, tuning controller and wireless sensor node.
Several test scenarios are considered, which illustrate how the
proposed approach permits the designer to adjust a wide range
of system parameters and evaluate the effect almost instantly but
still with high accuracy.
Keywords -energy harvesters, design of experiment, wireless sensor nodes
The main challenge in post-silicon debug is the lack of observability to the internal signals of a chip. Trace buffer technology provides one venue to address this challenge by online tracing of a few selected state elements. Due to the limited bandwidth of the trace buffer, only a few state elements can be selected for tracing. Recent research has focused on automated trace signal selection problem in order to maximize restoration of the untraced state elements using the few traced signals. Existing techniques can be categorized into high quality but slow "simulation-based", and lower quality but much faster "metric-based" techniques. This work presents a new trace signal selection technique which has comparable or better quality than simulation-based while it has a fast runtime, comparable to the metric-based techniques.
The exponentially growing complexity of modern processors intensifies verification challenges. Traditional pre-silicon verification covers less and less of the design space, resulting in increasing post-silicon validation effort. A critical challenge is the manual debugging of intermittent failures on prototype chips, where multiple executions of a same test do not yield a consistent outcome. We leverage the power of machine learning to support automatic diagnosis of these difficult, inconsistent bugs. During post-silicon validation, lightweight hardware logs a compact measurement of observed signal activity over multiple executions of a same test: some may pass, some may fail. Our novel algorithm applies anomaly detection techniques similar to those used to detect credit card fraud to identify the approximate cycle of a bug's occurrence and a set of candidate root-cause signals. Compared against other state-of-the-art solutions in this space, our new approach can locate the time of a bug's occurrence with nearly 4x better accuracy when applied to the complex OpenSPARC T2 design.
The internal state of complex modern processors often needs to be dumped out frequently during post-silicon validation. Since the last level cache (considered L2 in this paper) holds most of the state, the volume of data dumped and the transfer time are dominated by the L2 cache. The limited bandwidth to transfer data off-chip coupled with the large size of L2 cache results in stalling the processor for long durations when dumping the cache contents off-chip. To alleviate this, we propose to transfer only those cache lines that were updated since the previous dump. Since maintaining a bit-vector with a separate bit to track the status of individual cache lines is expensive, we propose 2 methods: (i) where a bit tracks multiple cache lines and (ii) an Interval Table which stores only the starting and ending addresses of continuous runs of updated cache lines. Both methods require significantly lesser space compared to a bit-vector, and allow the designer to choose the amount of space to allocate for this design-for-debug (DFD) feature. The impact of reducing storage space is that some non-updated cache lines are dumped too. We attempt to minimize such overheads. Further, the Interval Table is independent of the cache size which makes it ideal for large caches. Through experimentation, we also determine the break-even point below which a t-lines/bit bit-vector is beneficial compared to an Interval Table
This paper introduces a novel approach towards the
statistical analysis of modern high-speed I/O and similar
communication links, which is capable of reliably to determine
extremely low (~10 -12 or lower) bit error rates (BER) by using
techniques from extreme value theory (EVT). The new method
requires only a small amount of voltage values at the received eye
center, which can be generated by running circuit/system level
simulations or measuring fabricated I/O circuits, to predict link
BERs. Unlike conventional techniques, no simplifying
assumptions on link noise and interference sources are required
making this approach extremely portable to any communication
system operating with very low BER. Our experimental results
show that the BER estimates from the proposed methodology are
on the same order of magnitude as traditional time domain,
transient eye diagram simulations for links with BER of 10-6 and
10-5 operating at 9.6 and 10.1 Gbps respectively.
Index Terms - BER, EVT, I/O Links
During various stages of hardware design,
different types of control signals get introduced; clock, reset are
specified and connected at the RTL stage whereas signals like
scan enable, isolation enable, power switch enable get added to
implemented devices later in the flow.
The quality of Top Level Control Signals (TLCS) has a direct
impact on the quality of static verification which is used to verify
the intended connectivity and functionality of fan-out networks
corresponding to TLCS. Typically, users need to specify these
TLCS (along with their intended types) for such static
verification. But when TLCS are not known to the verification
engineer, reverse-engineering of clock, reset and scan network
implemented in a design becomes a non-trivial task.
This paper proposes a framework to automatically generate a
list of TLCS pertaining to the implemented design. The
framework describes a heuristic-based analysis of fan-in cones,
traversing backwards from the leaf cell instance pins. It is
independent of design style(s) as its core strength lies in its
capability to dynamically adapt to the new discoveries of the
design elements made during the traversal.
Keywords - Inference of Top level control signals, Static Verification, Low Power
Caches provide significant performance improvements, though their use in real-time industry is low because current WCET analysis tools require detailed knowledge of program's cache accesses to provide tight WCET estimates. Probabilistic Timing Analysis (PTA) has emerged as a solution to reduce the amount of information needed to provide tight WCET estimates, although it imposes new requirements on hardware design. At cache level, so far only fully-associative random-replacement caches have been proven to fulfill the needs of PTA, but they are expensive in size and energy. In this paper we propose a cache design that allows set-associative and direct-mapped caches to be analysed with PTA techniques. In particular we propose a novel parametric random placement suitable for PTA that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.
This paper presents a novel Multicore Architecture for Real-Time Hybrid Applications (MARTHA) with time-predictable execution, low computational latency, and high performance that meets the requirements for control, emulation and estimation of next-generation power electronics and smart grid systems. Generic general-purpose architectures running real-time operating systems (RTOS) or quality of service (QoS) schedulers have not been able to meet the hard real-time constraints required by these applications. We present a framework based on switched hybrid automata for modeling power electronics applications. Our approach allows a large class of power electronics circuits to be expressed as switched hybrid models which can be executed on a single hardware platform.
Complex Systems-on-Chips (SoC) are mixed time-criticality systems that have to support firm real-time (FRT) and soft real-time (SRT) applications running in parallel. This is challenging for critical SoC components, such as memory controllers. Existing memory controllers focus on either firm real-time or soft real-time applications. FRT controllers use a close-page policy that maximizes worst-case performance and ignore opportunities to exploit locality, since it cannot be guaranteed. Conversely, SRT controllers try to reduce latency and consequently processor stalling by speculating on locality. They often use an open-page policy that sacrifices guaranteed performance, but is beneficial in the average case. This paper proposes a conservative open-page policy that improves average-case performance of a FRT controller in terms of bandwidth and latency without sacrificing real-time guarantees. As a result, the memory controller efficiently handles both FRT and SRT applications. The policy keeps pages open as long as possible without sacrificing guarantees and captures locality in this window. Experimental results show that on average 70% of the locality is captured for applications in the CHStone benchmark, reducing the execution time by 17% compared to a close-page policy. The effectiveness of the policy is also evaluated in a multi-application use-case, and we show that the overall average-case performance improves if there is at least one FRT or SRT application that exploits locality.
The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, the STMicroelectronics/CEA Platform 2012 (P2012) project designed an area- and power-efficient many-core accelerator as an answer to the needs of computing power of next-generation data-intensive embedded applications. Synchronization handling on this architecture was critical since speed-ups of parallel implementations of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This paper presents the HardWare Synchronizer (HWS), a flexible hardware accelerator for synchronization operations in the P2012 architecture. Experiments on a multi-core test chip showed that the HWS has less than 1% area overhead while reducing synchronization latencies (up to 2.8 times) and contentions.
Due to latest advances in semiconductor integration, systems are becoming more susceptible to faults leading to temporary or permanent failures. We propose a new architecture extension suitable for arrays of functional units (FUs), that will provide testing and replacement of faulty units, without interrupting normal system operation. The extension relies on datapath switching realized by the proposed hot-swapping algorithm and structures, by use of which functional units are tested and replaced by spares, at lower overheads than traditional modular redundancy. For a case study architecture, hot-swapping support could be added with only 29% area overhead. In this paper we focus on experimental evaluation of the hot-swapping system from a fabricated chip in 65nm CMOS process. Autonomous testing of the hot-swapping system is enhanced with back-bias circuitry to attain an early fault detection and restoration system. Experimental measurements prove that the proposed concept works well, predicting fault occurrence with a configurable prediction interval, while power measurements reveal that with only 20% power overhead the proposed system can attain reliability levels similar to triple modular redundancy. Additionally, measurements reveal that manufacturing randomness across the die can significantly influence identical sub-circuit reliability located in different parts in the die, although identical layout has been employed.
We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon model-ing advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking program-ming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is ac-complished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Soft-ware supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP , consequently instruction per cycle (IPC) of a 16-core pro-cessor cluster is increased up to 1.51x (1.17x on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).
The downscaling of technology features has brought
the system developers an important design criteria, reliability, into
prime consideration. Due to external radiation effects and temperature
gradients, the CMOS device is not guaranteed anymore
to function flawlessly. On the other hand, admission for errors to
occur allows extending the power budget. The power-performance-reliability
trade-off compounds the system design challenge, for
which efficient design exploration framework is needed. In this
work, we present a high-level processor design framework extended
with two reliability estimation techniques. First, a simulation-based
technique, which allows a generic instruction-set simulator to
estimate reliability via high-level fault injection capability. Second,
a novel analytical technique, which is based on the reliability model
for coarse arithmetic logical operator blocks within a processor
instruction. The techniques are tested with a RISC processor
and several embedded application kernels. Our results show the
efficiency and accuracy of these techniques against a HDL-level
reliability estimation framework.
Keywords - Reliability Estimation; High-level Processor Design; Fault Simulation
In an effort to reduce the cost of specification testing in analog/RF circuits, spatial correlation modeling of wafer-level measurements has recently attracted increased attention. Existing approaches for capturing and leveraging such correlation, however, rely on the assumption that spatial variation is smooth and continuous. This, in turn, limits the effectiveness of these methods on actual production data, which often exhibits localized spatial discontinuous effects. In this work, we propose a novel approach which enables spatial correlation modeling of wafer-level analog/RF tests to handle such effects and, thereby, to drastically reduce prediction error for measurements exhibiting discontinuous spatial patterns. The core of the proposed approach is a k-means algorithm which partitions a wafer into k clusters, as caused by discontinuous effects. Individual correlation models are then constructed within each cluster, revoking the assumption that spatial patterns should be smooth and continuous across the entire wafer. Effectiveness of the proposed approach is evaluated on industrial probe test data from more than 3,400 wafers, revealing significant error reduction over existing approaches.
Advances in digital microfluidics and integrated sensing hold promise for a new generation of droplet-based biochips that can perform multiplexed assays to determine the identity of target molecules. Despite these benefits, defects and erroneous fluidic operations remain a major barrier to the adoption and deployment of these devices. We describe the first integrated demonstration of cyberphysical coupling in digital microfluidics, whereby errors in droplet transportation on the digital microfluidic platform are detected using capacitive sensors, the test outcome is interpreted by control hardware, and software-based error recovery is accomplished using dynamic reconfiguration. The hardware/software interface is realized through seamless interaction between control software, an off-the-shelf microcontroller and a frequency divider implemented on an FPGA. Experimental results are reported for a fabricated silicon device and links to videos are provided for the firstever experimental demonstration of cyberphysical coupling and dynamic error recovery in digital microfluidic biochips.
High test quality can be achieved through defect oriented testing using analog fault modeling approach. However, this approach is computionally demanding and typically hard to apply to large scale circuits. In this work, we use an improved inductive fault analysis approach to locate potential faults at layout level and calculate the relative probability of each fault. Our proposed method yields actionable results such as fault coverage of each test, potential faults, and probability of each fault. We show that the computational requirement can be significantly reduced by incorporating fault probabilities. These results can be used to improve fault coverage or to improve defect resilience of the circuit.
Testing and calibration of MEMS devices require physical stimulus, which results in the need for specialized test equipment and thus high test cost. It has been shown for various types of sensors that electrical stimulation can be used to facilitate lower cost calibration. In this paper, we present an electrical stimulus based test and calibration technique for overdamped spring-mass capacitive accelerometers which require the characterization of stationary and dynamic calibration coefficients. We show that these two coefficients can be electrically obtained.
Some data- and compute-intensive applications can be accelerated by offloading portions of codes to platforms such as GPGPUs or FPGAs. However, to get high performance for these kernels, it is mandatory to restructure the application, to generate adequate communication mechanisms for the transfer of remote data, and to make good usage of the memory bandwidth. In the context of the high-level synthesis (HLS), from a C program, of hardware accelerators on FPGA, we show how to automatically generate optimized remote accesses for an accelerator communicating to an external DDR memory. Loop tiling is used to enable block communications, suitable for DDR memories. Pipelined communication processes are generated to overlap communications and computations, thereby hiding some latencies, in a way similar to double buffering. Finally, not only intra-tile but also inter-tile data reuse is exploited to avoid remote accesses when data are already available in the local memory. Our first contribution is to show how to generate the sets of data to be read from (resp. written to) the external memory just before (resp. after) each tile so as to reduce communications and reuse data as much as possible in the accelerator. The main difficulty arises when some data may be (re)defined in the accelerator and should be kept locally. Our second contribution is an optimized code generation scheme, entirely at source-level, i.e., in C, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools demonstrate how to use our techniques to efficiently map C kernels to FPGA.
Synchronous languages ensure deterministic concurrency, but at the price of heavy restrictions on what programs are considered valid, or constructive. Meanwhile, sequential languages such as C and Java offer an intuitive, familiar programming paradigm but provide no guarantees with regard to deterministic concurrency. The sequentially constructive model of computation (SC MoC) presented here harnesses the synchronous execution model to achieve deterministic concurrency while addressing concerns that synchronous languages are unnecessarily restrictive and difficult to adopt. In essence, the SC MoC extends the classical synchronous MoC by allowing variables to be read and written in any order as long as sequentiality expressed in the program provides sufficient scheduling information to rule out race conditions. The SC MoC is a conservative extension in that programs considered constructive in the common synchronous MoC are also SC and retain the same semantics. In this paper, we identify classes of variable accesses, define sequential constructiveness based on the concept of SC-admissible scheduling, and present a priority-based scheduling algorithm for analyzing and compiling SC programs.
Recently, source-level software models are increasingly used for software simulation in TLM (Transaction Level Modeling)-based virtual prototypes of multicore systems. A source-level model is generated by annotating timing information into application source code and allows for very fast software simulation. Accurate cache simulation is a key issue in multicore systems design because the memory subsystem accounts for a large portion of system performance. However, cache simulation at source level faces two major problems: (1) as target data addresses cannot be statically resolved during source code instrumentation, accurate data cache simulation is very difficult at source level, and (2) cache simulation brings large overhead in simulation performance and therefore cancels the gain of source level simulation. In this paper, we present a novel approach for accurate data cache simulation at source level. In addition, we also propose a cache modeling method to accelerate both instruction and data cache simulation. Our experiments show that simulation with the fast cache model achieves 450.7 MIPS (million simulated instructions per second) on a standard x86 laptop, 2.3x speedup compared with a standard cache model. The source-level models with cache simulation achieve accuracy comparable to an Instruction Set Simulator (ISS). We also use a complex multimedia application to demonstrate the efficiency of the proposed approach for multicore systems design.
Limited Local Memory (LLM) multi-core architectures substitute cache with scratch pad memories (SPM), and therefore have much lower power consumption. As they lack of automatic memory management, programming on such architectures becomes challenging, in the sense that it requires the programmer/compiler to efficiently manage the limited local memory. Managing heap data of the tasks executing in the cores of an LLM multi-core is an important problem. This paper presents a fully automated and efficient scheme for heap data management. Specifically, we propose i) code transformation for automation of heap management, with seamless support for multi-level pointers, and ii) improved data structures to more efficiently manage unlimited heap data. Experimental results on several benchmarks from MiBench demonstrate an average 43% performance improvement over previous approach .
Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics. However, relatively low endurance has limited its practical applications. In this paper, in additional to existing hardware level optimizations, we propose software enabled wear-leveling techniques to further extend PCM's lifetime when it is adopted in embedded systems. A polynomial-time algorithm, the Software Wear-Leveling (SWL) algorithm, is proposed in this paper to achieve wear-leveling without hardware overhead. According to the experimental results, the proposed technique can reduce the number of writes on the most-written bits by more than 80% when compared with a greedy algorithm, and by around 60% when compared with the existing Optimal Data Allocation (ODA) algorithm with under 6% memory access overhead.
Probabilistic timing analysis (PTA), a promising alternative to traditional worst-case execution time (WCET) analyses, enables pairing time bounds (named probabilistic WCET or pWCET) with an exceedance probability (e.g., 10-16), resulting in far tighter bounds than conventional analyses. However, the applicability of PTA has been limited because of its dependence on relatively exotic hardware: fully-associative caches using random replacement. This paper extends the applicability of PTA to conventional cache designs via a software-only approach. We show that, by using a combination of compiler techniques and runtime system support to randomise the memory layout of both code and data, conventional caches behave as fully-associative ones with random replacement.
Recent trends in embedded system architectures brought a rapid shift towards multicore, heterogeneous and reconfigurable platforms. This imposes a large effort for programmers to develop their applications to efficiently exploit the underlying architecture. In addition, process variability issues lead to performance and power uncertainties, impacting expected quality of service and energy efficiency of the running software. In particular, variability may lead to sub-optimal runtime task allocation. In this paper we present a holistic approach to tackle these issues exploiting high level HW/SW modeling to customize the runtime library. The customization introduces variability awareness in task allocation decisions, with the final purpose of optimizing a given objective: Execution time, power consumption, or overall energy consumption. We present a complete walkthrough, from top-level modeling down to variability-aware execution using a parallelized computational kernel running on a next generation, NoC based, heterogeneous multicore simulation platform.
Near-Threshold Voltage (NTV) operation of a CMOS design is defined as the voltage-frequency operating point where the energy consumed per compute operation (pJ/op) reaches a minimum, or the energy efficiency (Mops/Watt) peaks. Typically, this operating voltage is above the nominal threshold voltage of the transistor. The peak efficiency is achieved by a balance of switching energy and idle or leakage energy.
Todays' MPSoC applications are requiring a
convergence between very high speed and ultra low power. Ultra
Wide Voltage Range (UWVR) capability appears as a solution for
high energy efficiency with the objective to improve the speed at
very low voltage and decrease the power at high speed. Using
Fully Depleted Silicon-On-Insulator (FDSOI) devices
significantly improves the trade-off between leakage, variability
and speed even at low-voltage. A full design framework is
presented for UWVR operation using FDSOI Ultra Thin Body
and Box technology considering power management, multi-VT
enablement, standard cells design and SRAM bitcells.
Technology performances are demonstrated on a ARM A9
critical path showing a speed increase from 40% to 200%
without added energy cost. In opposite, when performance is not
required, FDSOI enables to reduce leakage power up to 10X
using Reverse Body Biasing.
Keywords - energy efficiency, low voltage, adaptive architectures, FDSOI, Ultra Thin Body and Box
Carbon Nanotube Field-Effect Transistors (CNFETs)
are excellent candidates for building highly energy-efficient
digital systems. However, imperfections inherent in carbon
nanotubes (CNTs) pose significant hurdles to realizing practical
CNFET circuits. In order to achieve CNFET VLSI systems in the
presence of these inherent imperfections, careful orchestration of
design and processing is required: from device processing and
circuit integration, all the way to large-scale system design and
optimization. In this paper, we summarize the key ideas that
enabled the first experimental demonstration of CNFET
arithmetic and storage elements. We also present an overview of
a probabilistic framework to analyze the impact of various
CNFET circuit design techniques and CNT processing options on
system-level energy and delay metrics. We demonstrate how this
framework can be used to improve the energy-delay-product
(EDP) of CNFET-based digital systems.
Keywords - Carbon Nanotube; CNT; CNFET; Nanotechnology; Modeling; Imperfection; Variation; Three-Dimensional Circuits;
Vertically stacked nanowire FETs (NWFETs) with gate-all-around
structure are the natural and most advanced
extension of FinFETs. At advanced technology nodes, many
devices exhibit ambipolar behavior, i.e., the device shows n- and
p-type characteristics simultaneously. In this paper, we show
that, by engineering of the contacts and by constructing
independent double-gate structures, the device polarity can be
electrostatically programmed to be either n- or p-type. Such a
device enables a compact realization of XOR-based logic
functions at the cost of a denser interconnect. To mitigate the
added area/routing overhead caused by the additional gate , an
approach for designing an efficient regular layout, called Sea-of-Tiles
is presented. Then, specific logic synthesis techniques,
supporting the higher expressive power provided by this
technology, are introduced and used to showcase the
performance of the controllable-polarity NWFETs circuits in
comparison with traditional CMOS circuits.
Keywords - Nanowire transistors; controllable polarity; regular fabrics; XOR logic synthesis
Parallel programming requires the definition of shared-memory semantics by means of a consistency model, which affects how the parallel hardware is designed. Therefore, verifying the hardware compliance with a consistency model is a relevant problem, whose complexity depends on the observability of memory events. Post-silicon checkers analyze a single sequence of events per core and so do most pre-silicon checkers, although one reported method samples two sequences per core. Besides, most are post-mortem checkers requiring the whole sequence of events to be available prior to verification. On the contrary, this paper describes a novel on-the-fly technique for verifying memory consistency from an executable representation of a multicore system. To increase efficiency without hampering verification guarantees, three points are monitored per core. The sampling points are selected to be largely independent from the core's microarchitecture. The technique relies on concurrent relaxed scoreboards to check for consistency violations in each core. To check for global violations, it employs a linear order of events induced by a given test case. We prove that the technique neither indicates false negatives nor false positives when the test case exposes an error that affects the sampled sequences, making it the first on-the-fly checker with full guarantees. We compare our technique with two post-mortem checkers under 2400 scenarios for platforms with 2 to 8 cores. The results show that our technique is at least 100 times faster than a checker sampling a single sequence per processor and it needs approximately 1/4 to 3/4 of the overall verification effort required by a post-mortem checker sampling two sequences per processor.
Host-compiled simulation has been proposed for software performance estimation, because of its high simulation speed. However, the simulation speed may be significantly lowered due to the cache simulation overhead. In this paper, we propose an approach that can reduce much of the cache simulation overhead, while still calculating cache misses precisely. For instruction cache, we statically analyze possible cache conflicts and perform cache conflicts aware annotation for host-compiled simulation. Within loops, the conflicts are dynamically captured by tagging the basic blocks instead of performing the expensive cache simulation. In this way, a vast majority of the cache accesses can be saved from simulation. For data cache, aggregated cache simulation is used for a large data block. Further, the data locality can be bound by considering the data allocation principle of a program. Experiments show that our approach improves the speed of host-compiled simulation by one order of magnitude, while providing the cache miss numbers with high accuracy.
This paper proposes a Critical-Section-Level timing synchronization approach for
deterministic Multi-Core Instruction-Set Simulation (MCISS). By synchronizing at each
lock access instead of every shared-variable access and using a simple lock usage
status managing scheme, our approach significantly improves simulation performance
while executing all citical sections in a deterministic order. Experiments show that
our approach performs 295% faster than the shared-variable synchronization approach
on average and can effectively facilitate system-level software/hardware co-simulation.
Keywords - Deterministic, Multi-core instruction-set simulations, Timing Synchronization
Extremely long simulation time of architectural simulators has been a major impediment to their wide applicability. To accelerate architectural simulation, prior researchers have proposed representative sampling simulation to trade small loss of accuracy for notable speed improvement. Generally, they use fine-grained phase analysis to select only a small representative portion of program execution intervals for detailed cycle-accurate simulation, while functionally simulating the remaining portion. However, though phase granularity is one of the most important factors to simulation speed, it has not been well investigated and most prior researches explore a fine-grained scheme. This limits their effectiveness in further improving simulation speed with the requirement of increasingly complex architectural designs and new lengthy benchmarks. In this paper, by analyzing the impact of phase granularity on simulation speed, we observe that coarse-grained phases can better capture the overall program characteristics with a less number of phases and the last representative phase could be classified in a very early program position, leading to fewer execution internals being functionally simulated. By contrast, fine-grained phases usually have much shorter execution intervals and thus the overall detailed simulation time could be reduced. Based on the above observation, we design a multi-level sampling simulation technique that combines both fine-grained and coarse-grained phase analysis for sampling simulation. Such a scheme uses fine-grained simulation points to represent only the selected coarse-grained simulation points instead of the entire program execution, thus it could further reduce both the functional and detailed simulation time. Experimental results using SPEC2000 show such a framework is effective: using the SimPoint method as baseline, it can reduce about 90% functional simulation time and about 50% detailed simulation time. It finally achieves a geometric average speedup of 14.04X over SimPoint with comparable accuracy.
The need for detailed simulation of integrated circuits has received significant attention since the early stages of design automation. Given the increasing device integration, these simulations have extreme memory footprints, especially within unified memory hierarchies. This paper overcomes the infeasible memory demands of modern circuit simulators. Structural partitioning of the netlist and temporal partitioning of the input signals allow distributed execution with minimal memory requirements. The proposed framework is validated with simulations of a circuit with more than 106 MOSFET devices. In comparison to a commercial tool, we observe minimal error and even x2:35 speedup for moderate netlist sizes. The proposed framework is proven highly reusable across a variety of execution platforms.
Hardware coprocessors are extensively used in modern heterogeneous systems-on-chip (SoC) designs to provide efficient implementation of application-specific functions. Customized coprocessor synthesis exploits design space exploration to derive Pareto optimal design configurations for a set of targeted metrics. Existing exploration strategies for coprocessor synthesis have been focused on either time consuming iterative scheduling approaches or ad-hoc sampling of the solution space guided by the designer's experience. In this paper, we introduce a meta-model assisted exploration framework that eliminates the aforementioned drawbacks by using response surface models (RSMs) for generating customized coprocessor architectures. The methodology is based on the construction of analytical delay and area models for predicting the quality of the design points without resorting to costly architectural synthesis procedures. Various RSM techniques are evaluated with respect to their accuracy and convergence. We show that the targeted solution space can be accurately modeled through RSMs, thus enabling a speedup of the overall exploration runtime without compromising the quality of results. Comparative experimental results, over a set of real-life benchmarks, prove the effectiveness of the proposed approach in terms of quality improvements of the design solutions and exploration runtime reductions. An MPEG-2 decoder case study describes how the proposed approach can be exploited for customizing the architecture of two hardware accelerated kernels.
This work presents an energy-efficient memory
hierarchy for Motion and Disparity Estimation on Multiview Video
Coding employing a Reference Frames-Centered Data Reuse
(RCDR) scheme. In RCDR the reference search window becomes
the center of the motion/disparity estimation processing flow and
calls for processing all blocks requesting its data. By doing so,
RCDR avoids multiple search window retransmissions leading to
reduced number of external memory accesses, thus memory energy
reduction. To deal with out-of-order processing and further reduce
external memory traffic, a statistics-based partial results
compressor is developed. The on-chip video memory energy is
reduced by employing a statistical power gating scheme and
candidate blocks reordering. Experimental results show that our
reference-centered memory hierarchy outperforms the state-of-the-art
 by providing reduction of up to 71% for external
memory energy, 88% on-chip memory static energy, and 65% onchip
memory dynamic energy.
Index Terms - Multiview Video Coding, MVC, 3D-Video, Low-Power Design, On-Chip Video Memory, Application-Aware DPM, Memory Hierarchy, Energy Efficiency, Motion Estimation, Disparity Estimation.
In this paper, we introduce a novel modeling technique to reduce the time associated with cycle-accurate simulation of parallel applications deployed on many-core embedded platforms. We introduce an ensemble model based on artificial neural networks that exploits (in the training phase) multiple levels of simulation abstraction, from cycle-accurate to cycle-approximate, to predict the cycle-accurate results for unknown application configurations. We show that high-level modeling can be used to significantly reduce the number of low-level model evaluations provided that a suitable artificial neural network is used to aggregate the results. We propose a methodology for the design and optimization of such an ensemble model and we assess the proposed approach for an industrial simulation framework based on STMicroelectronics STHORM (P2012) many-core computing fabric.
Many application-specific processor design approaches are being proposed and investigated nowadays. All of them aim to cope with the emerging flexibility requirement combined with the best performance efficiency. Application Specific Instruction-set Processor (ASIP) design approach is among the most explored, and thus in many application domains. However, this concept implies a dynamic scheduling of a set of instructions which generally lead to an overhead related to instruction decoding. To reduce this overhead, other approaches were proposed using static scheduling of datapath control signals. In this paper, we explore this last approach and illustrate its benefits through a design case-study on MMSE MIMO equalization. The proposed design has common main architectural choices as a state-of-the-art ASIP for comparison purpose. The obtained results illustrate a significant improvement in execution time while using identical computational resources and supporting same flexibility parameters.
The dramatic increase in the number of processors, memories and other components in the same chip calls for resource-aware mechanisms to improve performance. This paper proposes four different resource mapping policies for NoC-based MPSoCs that leverage on distinct aspects of the parallel nature of the applications and on architecture constraints, such as off-chip memory latency. Results show that the use of these policies can improve performance up to 22.5% in average, and, in some cases, depending on the parallel programming model of each application, the improvement may reach up to 32%.
We use FusionSim to characterize the performance of the Rodinia benchmarks on fused and discrete systems.
We demonstrate that the speed-up due to fusion is highly correlated with the input data size. We demonstrate
that for benchmarks that benefit most from fusion, a 9.72x speed up is possible for small problem sizes.
This speedup reduces to 1.84x with medium or large problem sizes. We study a simple, software-managed coherence
solution for the fused system. We find that it imposes a minor performance overhead of 2% for most benchmarks
and as high as 5% for some. Finally, we develop an analytical model for the performance benefit that is to be
expected from fusion for applications with a simple communication and computation pattern and show that FusionSim
follows the predicted performance trend.
Keywords - CPU and GPU Fusion;
Shrinking transistor geometries, aggressive voltage scaling and higher operating frequencies have negatively impacted the lifetime reliability of embedded multi-core systems. In this paper, a convex optimization-based task-mapping technique is proposed to extend the lifetime of a multiprocessor systems-onchip (MPSoCs). The proposed technique generates mappings for every application enabled on the platform with variable number of cores. Based on these results, a novel 3D-optimization technique is developed to distribute the cores of an MPSoC among multiple applications enabled simultaneously. Additionally, reliability of the underlying network-on-chip links is also addressed by incorporating aging of links in the objective function. Our formulations are developed for directed acyclic graphs (DAGs) and synchronous dataflow graphs (SDFGs), making our approach applicable for streaming as well as non-streaming applications. Experiments conducted with synthetic and real-life application graphs demonstrate that the proposed approach extends the lifetime of an MPSoC by more than 30% when applications are enabled individually as well as in tandem.
Fault tolerance and load balancing are critical points for executing long-running parallel applications on
multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting
a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both
transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault
detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework
provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.
Keywords - fault tolerance; work-stealing; multicore; cluster
This paper proposes a method to determine a priority for applying selective triple modular redundancy (selective TMR) against single event upset (SEU) to achieve cost-effective reliable implementation of an application circuit to a coarse-grained reconfigurable architecture (CGRA). The priority is determined by an estimation of the vulnerability of each node in the data flow graph (DFG) of the application circuit. The estimation is based on a weighted sum of the features and parameters of each node in the DFG which characterize impact of the SEU in the node to the output data. This method does not require time-consuming placement-and-routing processes, as well as extensive fault simulations for various triplicating patterns, which allows us to identify the set of nodes to be triplicated for minimizing the vulnerability under given area constraint at the early stage of design flow. Therefore, the proposed method enables us efficient design space exploration of reliability-oriented CGRAs and their applications.
Soft error has been identified as one of the major challenges to CMOS technology based computing systems. To mitigate this problem, error recovery is a key component, which usually accounts for a substantial cost, since they must introduce redundancies in either time or space. Consequently, using state-of-art recovery techniques could heavily worsen the design constraint, which is fairly stringent for embedded system design. In this paper, we propose a HW/SW methodology that generates the processor, which performs finely configured error recovery functionality targeting the given design constraints (e.g., performance, area and power). Our methodology employs three application-specific optimization heuristics, which generate the optimized composition and configuration based on the two primitive error recovery techniques. The resultant processor is composed of selected primitive techniques at corresponding instruction execution, and configured to perform error recovery at run-time accordingly to the scheme determined at design time. The experiment results have shown that our methodology can at best achieve nine times reliability while maintaining the given constraints, in comparison to the state of the art.
NBTI and HCI are not only present in digital
circuits but also in analog circuitry. Integrated circuit amplifiers
as used in neural measurement systems (NMS) need to be resistive
against degradation since these systems cannot be replaced easily.
A topology driven design methodology to increase the reliability
of amplifiers used for intracortical neural recording has been
proposed in this work. This approach leads to a decrease in
degradation for some system performances by a factor of three.
It has been shown that the degradation of a circuit is highly
dependent on the selected current mirror and biasing circuit.
Index Terms - Analog circuits, negative bias temperature instability (NBTI), neural measurement system (NMS), circuit reliability.
Partially reconfigurable systems are more and more
employed in many application fields, including aerospace. SRAM-based
FPGAs represent an extremely interesting hardware platform
for this kind of systems, because they offer flexibility as well
as processing power. In this paper we report about the ongoing
development of a software flow for the generation of hard macros
for on-line testing and diagnosing of permanent faults due to
radiation in SRAM-FPGAs used in space missions. Once faults
have been detected and diagnosed the flow allows to generate
fine-grained patch hard macros that can be used to mask out
the discovered faulty resources, allowing partially faulty regions
of the FPGA to be available for further use.
Keywords - Automatic Test Pattern Generation, Fault Diagnosis; On-Line Testing; Permanent Radiation Effects; SRAM-FPGA
Safety-critical in-vehicle electronic control units (ECUs) demand high levels of determinism and isolation, since they directly influence vehicle behaviour and passenger safety. As modern vehicles incorporate more complex computational systems, ensuring the safety of critical systems becomes paramount. One-to-one redundant units have been previously proposed as measures for evolving critical functions like x-by-wire. However, these may not be viable solutions for power-constrained systems like next generation electric vehicles. Reconfigurable architectures offer alternative approaches to implementing reliable safety critical systems using more efficient hardware. In this paper, we present an approach for implementing redundancy in safety-critical in-car systems, that uses FPGA partial reconfiguration and a customised bus controller to offer fast recovery from faults. Results show that such an integrated design is better than alternatives that use discrete bus interface modules.
Traditional multi-core designs, based on the
Network-on-Chip (NoC) paradigm, suffer from high latency and
power dissipation as the system size scales up due to the inherent
multi-hop nature of communication. Introducing long-range, low
power, and high-bandwidth, single-hop links between far apart
cores can significantly enhance the performance of NoC fabrics.
In this paper, we propose design of a small-world network based
NoC architecture with on-chip millimeter (mm)-wave wireless
links. The millimeter wave small-world NoC (mSWNoC) is
capable of improving the overall latency and energy dissipation
characteristics compared to the conventional mesh-based
counterpart. The mSWNoC helps in improving the energy
dissipation, and hence the thermal profile, even further in
presence of network-level dynamic voltage and frequency scaling
(DVFS) without incurring any additional latency penalty.
Keywords - NoC, wireless, mm-wave, small world, DVFS
Balancing cache energy efficiency and reliability is
a major challenge for future multicore system design. Supply
voltage reduction is an effective tool to minimize cache energy
consumption, usually at the expense of increased number of
errors. To achieve substantial energy reduction without
degrading reliability, we propose an adaptive fault-tolerant cache
architecture, which provides appropriate error control for each
cache line based on the number of faulty cells detected at reduced
supply voltages. Our experiments show that the proposed
approach can improve energy efficiency by more than 25% and
energy-execution time product by over 10%, while improving
reliability up to 4X using Mean-Error-To-Failure (METF)
metric, compared to the next-best solution at the cost of 0.08%
Keywords - Energy efficiency, fault tolerance, cache, VLSI, multicore.
Many multicore chips today employ advanced power
management techniques. Multi-threshold CMOS (MTCMOS) is
very effective for reducing standby leakage power. Dynamic
voltage scaling and voltage islands which operate at multiple
power-supply voltage levels, minimize dynamic power consumption.
Effective defect screening for such chips requires advanced
test techniques that target defects in the embedded cores and
the power management structures. We describe recent advances
in test generation and test scheduling techniques for SoCs that
support power switches, voltage islands, and dynamic voltage
Index Terms - Dynamic power, dynamic voltage scaling, power switches, SoC test scheduling, static power.
This paper discusses how adaptive test techniques can be
applied to multi-core RF SoCs, together with design implementation
and test challenges. Various techniques specific to RF circuits
covering calibration trims, power management modules, co-existence
issues, concurrent testing, and test measurements are explained.
Results on different designs are presented. Together, they highlight
the need and scope of adaptive test for RF circuits, and share a new
dimension in the test of multi-core circuits, under different
constraints of design, test and test equipment.
Keywords: Adaptive test, RF test, multi-core chips, test time optimization.
We study the problem of assigning speeds to resources serving distributed applications with delay, buffer and energy constraints. We argue that the considered problem does not have any straightforward solution due to the intricately related constraints. As a solution, we propose using Real-Time Calculus (RTC) to analyse the constraints and a SATisfiability solver to efficiently explore the design space. To this end, we develop an SMT solver by using the OpenSMT framework and the Modular Performance Analysis (MPA) toolbox. Two key enablers for this implementation are the analysis of incomplete models and generation of conflict clauses in RTC. The results on problem instances with very large decision spaces indicate that the proposed SMT solver performs very well in practice.
Due to a growing need for flexibility, massively parallel Multiprocessor SoC (MPSoC) architectures are currently being developed. This leads to the need for parallel software, but poses the problem of the efficient deployment of the software on these architectures. To address this problem, the execution of the parallel program with software traces enabled on the platform and the visualization of these traces to detect irregular timing behavior is the rule. This is error prone as it relies on software logs and human analysis, and requires an existing platform. To overcome these issues and automate the process, we propose the conjoint use of a virtual platform logging at hardware level the memory accesses and of a data-mining approach to automatically report unexpected instructions timings, and the context of occurrence of these instructions. We demonstrate the approach on a multiprocessor platform running a video decoding application.
Reducing the energy consumption of controllers in vehicles requires sophisticated regulation mechanisms. Better power management can be enabled by allowing the controller to shut down sensors, actuators or embedded control units in a way that keeps the car safe and comfortable for the user, with the goal of optimizing the (average or maximal) energy consumption. This paper proposes an approach to systematically explore the design space of SW/HW mappings to determine energy-optimal deployments. It employs constraint-solving techniques for generating deployment candidates and probabilistic analyses for computing the expected energy consumption of the respective deployment. The feasibility and scalability of the method is demonstrated by several case studies.
In this paper we propose a new method for the analysis of response times in uni-processor real-time systems where task activation patterns may contain sporadic bursts. We use a burst model to calculate how often response times may exceed the worst-case response time bound obtained while ignoring bursts. This work is of particular interest to deal with dual-cyclic frames in the analysis of CAN buses. Our approach can handle arbitrary activation patterns and the static priority preemptive as well as non-preemptive scheduling policies. Experiments show the applicability and the benefits of the proposed method.
New media processing applications such as image recognition and AR (Augment Reality) have become into practical on
embedded systems for automotive, digital-consumer and mobile products. Many-core processors have been proposed to
realize much higher performance than multi-core processors. We have developed a low-power many-core SoC for multimedia
applications in 40nm CMOS technology. Within a 210mm2 die, two 32-core clusters are integrated with dynamically
reconfigurable processors, hardware accelerators, 2-channel DDR3 I/Fs, and other peripherals. Processor cores in the
cluster share a 2MB L2 cache connected through a tree-based Network-on-Chip (NoC). Its total peak performance exceeds
1.5TOPS (Tera Operations Per Second). The high scalability and low power consumption are accomplished by parallelized
firmware for multimedia applications. It operates the 1080p 30fps H.264 decoding about 400mW and the 4K2K 15fps super
resolution under 800mW.
Keywords - Many-core; Network-on-Chip; VLIW; Low power ; Power gating; H.264; Super resolution
This paper describes current practices regarding
low power SoC aimed at wireless applications.
Keywords - microprocessor, wireless application, gate-level models, DVFS, AVS, voltage stack, power management, low-power
3D stacking is currently seen as a breakthrough technology for improving bandwidth and energy efficiency in multi-core architectures. The expectation is to solve major issues such as external memory pressure and latency while maintaining reasonable power consumption. In this paper, we show some advances in this field of research, starting with memory interface solutions as WIDEIO experience on a real chip for solving DRAM accesses issue. We explain the integration of a 512-bit memory interface in a Network-on-Chip multi-core framework and we show the performance we can achieve, these results being based on a 65nm prototype integrating 10μm diameter Through Silicon Vias. We then present the potentiality of new fine grain 3D stacking technology for power-efficient memory hierarchy. We expose an innovative 3D stacked multi-cache strategy aimed at lowering memory latency and external memory bandwidth requirements and thus demonstrating the efficiency of 3D stacking to rethink architectures for obtaining unequalled performances in power efficiency.
We consider the verification of safety (strict serializability and abort consistency) and liveness (obstruction and livelock freedom) for the hybrid transactional memory framework FLEXTM. This framework allows for flexible implementations of transactional memories based on an adaptation of the MESI coherence protocol. FLEXTM allows for both eager and lazy conflict resolution strategies. Like in the case of Software Transactional Memories, the verification problem is not trivial as the number of concurrent transactions, their size, and the number of accessed shared variables cannot be a priori bounded. This complexity is exacerbated by aspects that are specific to hardware and hybrid transactional memories. Our work takes into account intricate behaviours such as cache line based conflict detection, false sharing, invisible reads or non-transactional instructions. We carry out the first automatic verification of a hybrid transactional memory and establish, by adopting a small model approach, challenging properties such as strict serializability, abort consistency, and obstruction freedom for both an eager and a lazy conflict resolution strategies. We also detect an example that refutes livelock freedom. To achieve this, our prototype tool makes use of the latest antichain based techniques to handle systems with tens of thousands of states.
In 2011, property directed reachability (PDR) was proposed as an efficient algorithm to solve hardware model checking problems. Recent experimentation suggests that it outperforms interpolation-based verification, which had been considered the best known algorithm for this purpose for almost a decade. In this work, we present a generalization of PDR to the theory of quantifier free formulae over bitvectors (QF BV), illustrate the new algorithm with representative examples and provide experimental results obtained from experimentation with a prototype implementation.
In numerous EDA flows, time-consuming computations are repeatedly applied to sequential circuits. This motivates developing methods to determine what circuits have been processed already by a tool. This paper proposes an algorithm for semi-canonical labeling of nodes in a sequential AIG, allowing problems or sub-problems solved by an EDA tool to be cached with their computed results. This can speed up the tool when applied to designs with isomorphic components or design suites exhibiting substantial structural similarity.
This paper introduces a new technique for a fast computation of the Cone-Of-Influence (COI) of multiple properties. It specifically addresses frameworks where multiple properties belongs to the same model, and they partially or fully share their COI. In order to avoid multiple repeated visits of the same circuit sub-graph representation, it proposes a new algorithm, which performs a single topological visit of the variable dependency graph. It also studies mutual relationships among different properties, based on the overlapping of their COIs. It finally considers state variable scoring, based on their own COIs and/or their appearance in multiple COIs, as a new statistic for variable sorting and grouping/clustering in various Model Checking algorithms. Preliminary results show the advantages, and potential applications of these ideas.
A new SAT-Based algorithm for symbolic model checking has been gaining popularity. This algorithm, referred to as "Incremental Construction of Inductive Clauses for Indubitable Correctness" (IC3) or "Property Directed Reachability" (PDR), uses information learned from SAT instances of isolated time frames to either prove that an invariant exists, or provide a counter example. The information learned between each time frame is recorded in the form of cubes of the state variables. In this work, we study the effect of extending PDR to use cubes of intermediate variables representing the logic gates in the transition relation. We demonstrate that we can improve the runtime for satisfiable benchmarks by up to 3.2X, with an average speedup of 1.23X. Our approach also provides a speedup of up to 3.84X for unsatisfiable benchmarks.
Conjunctive Normal Form (CNF) representation as used by most modern Quantified Boolean Formula (QBF) solvers is simple and powerful when reasoning about conflicts, but is not efficient at dealing with solutions. To overcome this inefficiency a number of specialized non-CNF solvers were created. These solvers were shown to have great advantages. Unfortunately, non-CNF solvers cannot benefit from sophisticated CNF-based techniques developed over the years. This paper demonstrates how the power of non-CNF structure can be harvested without the need for specialized solvers; in fact, it is easily incorporated into most existing CNF-based QBF solvers using a pre-existing mechanism of cube learning. We demonstrate this using a state-of-the-art QBF solver DepQBF, and experimentally show the effectiveness of our approach.
Modern systems demand high performance, as well as high degrees of flexibility and adaptability. Many current applications exhibit a dynamic and nonstationary behavior, having certain characteristics in one phase of their execution, that will change as the applications enter new phases, in a manner unpredictable at design-time. In order to meet the performance requirements of such systems, it is important to have on-line optimization algorithms, coupled with adaptive hardware platforms, that together can adjust to the run-time conditions. We propose an optimization technique that minimizes the expected execution time of an application by dynamically scheduling hardware prefetches. We use a piecewise linear predictor in order to capture correlations and predict the hardware modules to be reached. Experiments show that the proposed algorithm outperforms the previous state-of-art in reducing the expected execution time by up to 27% on average.
A multi-mode circuit implements the functionality of a limited number of circuits, called modes, of which at any given time only one needs to be realised. Using run-time reconfiguration of an FPGA, all the modes can be implemented on the same reconfigurable region, requiring only an area that can contain the biggest mode. Typically, conventional run-time reconfiguration techniques generate a configuration for every mode separately. To switch between modes the complete reconfigurable region is rewritten, which often leads to very long reconfiguration times. In this paper we present a novel, fully automated tool flow that exploits similarities between the modes and uses Dynamic Circuit Specialization to drastically reduce reconfiguration time. Experimental results show that the number of bits that is rewritten in the configuration memory reduces with a factor from 4.6x to 5.1x without significant performance penalties.
Different applications exhibit different behavior that cannot be optimally captured by a fixed organization of a VLIW processor. However, through exploitation of reconfigurable hardware we can optimize the organization when running different applications. In this paper, we propose a novel way to execute the same binary on different issue-width processors without much hardware modifications. We propose to change the compiler and assembler to ensure correct results. Our experiments show an average slowdown of around 1:3x when compared to binaries compiled for specific issue-widths. This can be further improved to less than 1:09x on average with additional compiler optimizations. Even though the flexibility comes at a price, it can be exploited for many other purposes, such as: dynamic performance/energy trade-off and energy-saving mechanisms, dynamic hardware sharing, and dynamic code insertion for hardware fault detection mechanisms.
Run-time reconfigurable (RTR) FPGAs combine the flexibility of software with the high efficiency of hardware. Still,
their potential cannot be fully exploited due to increased complexity of the design process. Consequently, to enable an
efficient design flow, we devise a set of prerequisites to increase the flexibility and reusability of current
FPGA-based RTR architectures. We apply these principles to design and implement the RecoBlock SoC platform, which main
characterization is (1) a RTR plug-and-play IP-Core whose functionality is configured at run-time; (2) flexible
inter-block communication configured via software, and (3) built-in buffers to support data-driven streams and
inter-process communications. We illustrate the potential of our platform by a tutorial case study using an adaptive
streaming application to investigate different combinations of reconfigurable arrays and schedules. The experiments
underline the benefits of the platform and shows resource utilization.
Keywords - reconfigurable architectures; partial and run-time reconfiguration; system-on-chip; adaptivity; embedded systems
Wireless sensor networks (WSNs) are often primarily tailored to single applications to achieve one specific mission. Considering that the same physical phenomenon can be used by multiple applications, the benefit of sharing the WSN infrastructure is obvious in terms of development and deployment cost. However, allocating the tasks to the WSNs to meet the requirements of all applications while keeping the energy efficiency is very challenging. Introducing reconfigurable nodes in the shared sensor networks can improve the performance, the energy efficiency and the flexibility but it increases the system complexity. In this paper, we propose a biologically inspired node configuration scheme in shared reconfigurable sensor network named DANCE, which can adapt to the changing environment and efficiently utilize WSN resources. Our experiments show that our scheme reduces the energy consumption by up to 76%.
The communication infrastructure is one of the important components of a multicore system along with the computing cores and memories. A good interconnect design plays a key role in improving the performance of such systems. In this paper, we introduce a hybrid communication infrastructure using both the standard bus and our area-efficient and delay-optimized network on chip for heterogeneous multicore systems, especially hardware accelerator systems. An adaptive data communication-based mapping for reconfigurable hardware accelerators is proposed to obtain a low overhead and latency interconnect. Experimental results show that the proposed communication infrastructure and the adaptive data communication-based mapping achieves a speed-up of 2.4x with respect to a similar system using only a bus as interconnect. Moreover, our proposed system achieves a reduction of energy consumption of 56% compared to the original system
Emerging memory technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor designs. Among various emerging memory technologies, Spin-Torque Transfer RAM (STT-RAM) has the benefits of fast read latency, low leakage power, and high density, and therefore has been investigated as a promising candidate for last-level cache (LLC). One of the major disadvantages for STT-RAM is the latency and energy overhead associated with the write operations. In particular, a long-latency write operation to STT-RAM cache may obstruct other cache accesses and result in severe performance degradation. Consequently, mitigation techniques to minimize the write overhead are required in order to successfully adopt this new technology for cache design. In this paper, we propose an obstruction-aware cache management policy called OAP. OAP monitors the cache to periodically detect LLC-obstruction processes, and manage the cache accesses from different processes. The experimental results on a 4-core architecture with an 8MB STT-RAM L3 cache shows that the performance can be improved by 14% on average and up to 42%, with a reduction of energy consumption by 64%.
The spin-transfer torque random access memory (STT-RAM) has been widely investigated as a promising candidate to replace the static random access memory (SRAM) as on-chip cache memories. However, the existing STT-RAM cell designs can be used for only single-port accesses, which limits the memory access bandwidth and constraints the system performance. In this work, we propose the design solutions to provide dual-port accesses for STT-RAM. The area increment by introducing an additional port is reduced by leveraging the shared source-line structure. Detailed analysis on the performance/ reliability degradation caused by dual-port accesses and the corresponding design optimization are performed. We propose two types of dual-port STT-RAM cell structures having 2 read/write ports (2RW) or 1-read/1-write port (1R/1W), respectively. Comparison shows that a 2RW STT-RAM cell consumes only 42% of area of a dual-port SRAM. The 1R/1W design further reduces 7.7% of cell area under same performance target.
In the latest PRAM/DRAM hybrid MLC NAND flash storage systems (NFSS), DRAM is used to temporarily store file system data for system response time reduction. To ensure data integrity, super-capacitors are deployed to supply the backup power for moving the data from DRAM to NAND flash during power failures. However, the capacitance degradation of super-capacitor severely impairs system robustness. In this work, we proposed a low cost power failure protection scheme to reduce the energy consumption of power failure protection and increase the robustness of the NFSS with PRAM/DRAM hybrid buffer. Our scheme enables the adoption of the more reliable regular capacitor to replace the super capacitor as the backup power. The experimental results shows that our scheme can substantially reduce the capacitance budget of power failure protection circuitry by 75.1% with very marginal performance and energy overheads.
Nonvolatile processor (NVP) has become an emerging topic in recent years. The conventional NV processor equips each flip-flop with a nonvolatile storage for data backup, which results in much faster backup speed with significant area overheads. A compression based architecture (PRLC) solved the area problem but with a nontrivial increasing on backup time. This paper provides a segment-based parallel compression (SPaC) architecture to achieve tradeoffs between area and backup speed. Furthermore, we use an off-line and online hybrid method to balance the workloads of different compression modules in SPaC. Experimental results show that SPaC can achieve 76% speed up against PRLC and meanwhile reduces the area by 16% against conventional NV processors.
Sustainability of wireless sensor network (WSN) is crucial to its economy and efficiency. While previous works have focused on solving the energy source limitation through solar energy harvesting, we reveal in this paper that sensor node's lifespan could also be limited by memory wear-out and battery cycle life. We propose a sustainable sensor node design that takes all three limiting factors into consideration. Our design uses Phase Change Memory (PCM) to solve Flash memory's endurance issue. By leveraging PCM's adjustable write width, we propose a low-cost, fine-grained load tuning technique that allows the sensor node to match current MPP of solar panel and reduces the number of discharge/charge cycles on battery. Our modeling and experiments show that our sustainable sensor node design can achieve on average 5.1 years of node lifetime, more than 2x over the baseline.
The computation capacity of conventional FPGA is directlyu proportional to the size
and expressive power of Look Up Table (LUT) resources. Individual LUT performance is
limited by transistor switching time and power dissipation, defined by the CMOS
fabrication process. In this paper we propose OLUT, an optical core implementation
of LUT, which has the potential for low latency and low power computation. In addition,
the use of Wavelength Devision Multiplexing (WDM) allows parallel computation,
which can further increase computation capacity. Preliminary experimental results
demonstrate the potential for optically assisted on-chip computation.
Index Terms - silicon photonic architectures, WDM, LUT
Single layer sheets of graphene show special electrical properties that can enable the next generation of smart ICs. Recent works have proven the availability of an electrostatically controlled pn-junction upon which it is possible to design multifunction reconfigurable logic devices that naturally behave as multiplexers. In this work we introduce a stable large-signal Verilog-A model that mimics the behavior of the aforementioned devices. The proposed model, validated through the SPICE characterization of a MUX-based standard cell library we designed as benchmark, represents a first step towards the integration of Electronic Design Automation tools that can support the design of all-graphene ICs.
Integrating residential photovoltaic (PV) power
generation and electrical energy storage (EES) systems into the
Smart Grid is an effective way of utilizing renewable power and
reducing the consumption of fossil fuels. This has become a
particularly interesting problem with the introduction of
dynamic electricity energy pricing models since electricity
consumers can use their PV-based energy generation and EES
systems for peak shaving on their power demand profile from the
grid, and thereby, minimize their electricity bill. Due to the
characteristics of a realistic electricity price function and the
energy storage capacity limitation, the control algorithm for a
residential EES system should accurately account for various
energy loss components during operation. Hybrid electrical
energy storage (HEES) systems are proposed to exploit the
strengths of each type of EES element and hide its weaknesses so
as to achieve a combination of performance metrics that is
superior to those of any of its individual EES components. This
paper introduces the problem of how best to utilize a HEES
system for a residential Smart Grid user equipped with PV
power generation facilities. The optimal control algorithm for the
HEES system is developed, which aims at minimization of the
total electricity cost over a billing period under a general
electricity energy price function. The proposed algorithm is based
on dynamic programming and has polynomial time complexity.
Experimental results demonstrate that the proposed HEES
system and optimal control algorithm achieves 73.9% average
profit enhancement over baseline homogeneous EES systems.
Keywords - hybrid electrical energy storage system; smart grid; optimal control
This paper presents a new Driver Assistant System
(DAS) using radar signatures. The new system is able in one hand
to track multiple obstacles and on the other hand to identify
obstacles during vehicle movements. The combination of these
two functions on the same DAS gives the benefits of avoiding false
alarms. Also, it makes possible to generate alarms that take into
account the identification of the obstacles. The obstacle tracking
process is simplified thanks to the identification stage. Hence, our
low cost FPGA-based System-on-Chip is able to detect, recognize
and track a large number of obstacles in a relatively reduced time
period. Our experimental result proves that a speed up of 32%
can be obtained compared to the standard system.
Index Terms - FPGA, Driver Assistance System, Radar signature, MTT, System-on-Chip
This paper presents a fully implantable neural recording system for the simultaneous recording of 128 channels. The electrocorticography (ECoG) signals are sensed with 128 gold electrodes embedded in a 10 μm thick polyimide foil. The signals are picked up by eight amplifier array ICs and digitized with a resolution of 16 bit at 10 kHz. The digitized measurement data is processed in a reconfigurable digital ASIC, which is fabricated in a 0.35 μm CMOS technology and occupies an area of 2.8x2.8mm2. After data reduction, the measurement data is fed into a transceiver IC, which transmits the data with up to 495 kbit/s to a base station, using an RF loop antenna on a flexible PCB. The power consumption of 84mW is delivered via inductive coupling from the base station.
Smart Wireless Body Sensor Nodes (WBSNs) are a novel class of unobtrusive, battery-powered devices allowing the continuous monitoring and real-time interpretation of a subject's bio-signals. One of its most relevant applications is the acquisition and analysis of Electrocardiograms (ECGs). These low-power WBSN designs, while able to perform advanced signal processing to extract information on hearth conditions of subjects, are usually constrained in terms of computational power and transmission bandwidth. It is therefore beneficial to identify in the early stages of analysis which parts of an ECG acquisition are critical and activate only in these cases detailed (and computationally intensive) diagnosis algorithms. In this paper, we introduce and study the performance of a real-time optimized neuro-fuzzy classifier based on random projections, which is able to discern normal and pathological heartbeats on an embedded WBSN. Moreover, it exposes high confidence and low computational and memory requirements. Indeed, by focusing on abnormal heartbeats morphologies, we proved that a WBSN system can effectively enhance its efficiency, obtaining energy savings of as much as 63% in the signal processing stage and 68% in the subsequent wireless transmission when the proposed classifier is employed.
Energy harvesting allows low-power embedded devices to be powered from naturally-ocurring or unwanted environmental energy (e.g. light, vibration, or temperature difference). While a number of systems incorporating energy harvesters are now available commercially, they are specific to certain types of energy source. Energy availability can be a temporal as well as spatial effect. To address this issue, "hybrid" energy harvesting systems combine multiple harvesters on the same platform, but the design of these systems is not straightforward. This paper surveys their design, including trade-offs affecting their efficiency, applicability, and ease of deployment. This survey, and the taxonomy of multi-source energy harvesting systems that it presents, will be of benefit to designers of future systems. Furthermore, we identify and comment upon the current and future research directions in this field.
Photovoltaic (PV) systems are often subject to partial shading that significantly degrades the output power of the whole systems. Reconfiguration methods have been proposed to adaptively change the PV panel configuration according to the current partial shading pattern. The reconfigurable PV panel architecture integrates every PV cell with three programmable switches to facilitate the PV panel reconfiguration. The additional switches, however, increase the capital cost of the PV system. In this paper, we group a number of PV cells into a PV macro-cell, and the PV panel reconfiguration only changes the connections between adjacent PV macro-cells. The size and internal structure (i.e., the series-parallel connection of PV cells) of all PV macro-cells are the same and will not be changed after PV system installation in the field. Determining the optimal size of the PV macro-cell is the result of a trade-off between the decreased PV system capital cost and enhanced PV system performance. A larger PV macro-cell reduces the cost overhead whereas a smaller PV macro-cell achieves better performance. In this paper, we set out to calculate the optimal size of the PV macro-cells such that the maximum system performance can be achieved subject to an overall system cost limitation. This "design" problem is solved using an efficient search algorithm. In addition, we provide for in-field reconfigurability of the PV panel by enabling formation of series-connected groups of parallel-connected macro-cells. We ensure maximum output power for the PV system in response to any incurring partial shading pattern. This "architecture optimization" problem is solved using dynamic programming.
Movement monitoring using wearable computers
has been widely used in healthcare and wellness applications.
To reduce the form factor of wearable nodes which is
dominated by battery size, ultra-low power signal processing is
crucial. In this paper, we propose an architecture that can be
viewed as a hardware accelerator and employs dynamic time
warping (DTW) in a hierarchical fashion. The proposed
architecture removes events that are not of interest from the
signal processing chain as early as possible, deactivating all
remaining modules. We consider tunable parameters such as
sampling frequency and bit resolution of the incoming sensor
readings for DTW to balance the power consumption and
classification precision trade-off. We formulate a methodology
for determining the optimal set of tunable parameters and
provide a solution using Active-set algorithm. We synthesized
the architecture using 45nm CMOS and illustrated that a
three-tiered module achieves 98% accuracy with a power
budget of 1.23μW, while a single level DTW consumes 6.3μW
with the same accuracy. We furthermore propose a fast
approximation methodology that runs 3200 times faster while
introducing less than 3% error over the original optimization
for determining the total power consumption.
Keywords - Hardware Accelerator; Activity Recognition; Granular Decision Making Module; Dynamic Time Warping
Semiconductor technology evolution enables the design
of sensor-based battery-powered ultra-low-cost chips (e.g.,
below 1 C) required for new market segments such as body, urban
life and environment monitoring. Caches have been shown to be
the highest energy and area consumer in those chips.
This paper proposes a novel, hybrid-operation (high Vcc, ultra-low
Vcc), single-Vcc domain cache architecture based on replacing
energy-hungry bitcells (e.g., 10T) by more energy-efficient
and smaller cells (e.g., 8T) enhanced with Error Detection and
Correction (EDC) features for high reliability and performance
predictability. Our architecture is proven to largely outperform
existing solutions in terms of energy and area.
Index Terms - Caches, Low Energy, Reliability, Real-Time
This paper describes a low overhead software-based fault tolerance approach for shared memory multicore systems. The scheme is implemented at user-space level and requires almost no changes to the original application. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. It provides a very low overhead mechanism to achieve this. Moreover it implements a fast error detection and recovery mechanism. The overhead incurred by our approach ranges from 0% to 18% for selected benchmarks. This is lower than comparable systems published in literature.
Soft errors and errors caused by intermittent faults are a major concern for modern processors. In this paper we provide a drastically different approach for fault tolerant scheduling (FTS) of tasks in such processors. Traditionally in FTS, error detection is performed implicitly and concurrently with task execution, and associated overheads are incurred as increases in software run-time or hardware area. However, such embedded error detection (EED) techniques, e.g., watchdog processor assisted control flow checking, only provide approximately 70% error coverage [1, 2]. We propose the idea of utilizing straightforward explicit output comparison (EOC) which provides nearly 100% error coverage. We construct a framework for utilizing EOC in FTS, identify new challenges and tradeoffs, and develop a new off-line scheduling algorithm for EOC. We show that our EOC based approach provides higher error coverage and an average performance improvement of nearly 10% over EED-based FTS approaches, without increasing resource requirements. In our ongoing research we are identifying a richer set of ways of applying EOC, by itself and in conjunction with EED, to obtain further improvements.
In this paper, we propose a new, low hardware overhead solution for permanent fault detection at the micro-architecture/ instruction level. The proposed technique is based on an ultra-reduced instruction set co-processor (URISC) that, in its simplest form, executes only one Turing complete instruction - the subleq instruction. Thus, any instruction on the main core can be redundantly executed on the URISC using a sequence of subleq instructions, and the results can be compared, also on the URISC, to detect faults. A number of novel software and hardware techniques are proposed to decrease the performance overhead of online fault detection while keeping the error detection latency bounded including: (i) URISC routines and hardware support to check both control and data flow instructions; (ii) checking only a subset of instructions in the code based on a novel check window criterion; and (iii) URISC instruction set extensions. Our experimental results, based on FPGA synthesis and RTL simulations, illustrate the benefits of the proposed techniques.
Embedded systems consist of hardware and software
and are ubiquitous in safety-critical and mission-critical fields.
The increasing integration density of modern, digital circuits
causes an increasing vulnerability of embedded systems to
transient faults. Techniques to improve the fault tolerance are
often either implemented in hardware or in software.
In this paper, we focus on synthesis techniques to improve
the fault tolerance of embedded systems considering hardware
and software. A greedy algorithm is presented which iteratively
assesses the fault tolerance of a processor-based system and
decides which components of the system have to be hardened
choosing from a set of existing techniques. We evaluate the
algorithm in a simple case study using a Traffic Collision
Avoidance System (TCAS).
Index Terms - Fault tolerance, Formal methods, Synthesis, Optimization
The aggressive scaling down technology has posed transistor aging to be a new challenging to the reliability of circuits. Transistor aging could cause the gradual degradation of circuit performance and eventually lead to timing error. In this paper, a dynamic self-adaptive method is proposed to protect the circuit from the influence of transistor aging. This makes use of aging detection sensors and self-adaptive clock scaling cell. Aging sensors would automatically wake up the clock scaling cell to shift the clock phase of circuits when an error occurs. Then the timing error would be masked by a second sampling with the shifted clock. The method is simulated by Hspice using 65nm technology. The evaluation results show that this method is effective to error resilient with no impact on normal function of circuits, and it improves the MTTF by 1.16 times with 22.73% circuit overheads on average when the phase difference is 20% clock cycle.
Information technology is now an indispensable
pillar of a modern day society. CMOS technologies, which
lay the foundation for all digital platforms, however, are
experiencing a major inflection point due to a slowdown in
voltage scaling. The net result is that power is emerging as
the key design constraint for all platforms from embedded
systems to datacenters. This tutorial presents emerging
design paradigms from embedded multicore SoCs to server
processors for scale-out datacenters based on mobile
Keywords - MPSoC, ARMv8 architecture, hardware virtualization, IOMMU, scale-out processors, total cost of ownership
UTBB FD-SOI technology has become mainstream
within STMicroelectronics, with the objective to serve a wide
spectrum of mobile multimedia products. This breakthrough
technology brings a significant improvement in terms of
performance and power saving, complemented by an excellent
responsiveness to power management design techniques for
energy efficiency optimization. The symbiosis between process
and design is key in this achievement enabling to provide already
at 28nm node a real differentiation in terms of flexibility, cost
and energy efficiency with respect to any process available on the
Keywords: UTBB FD-SOI, CMOS, high-performance, low-power, mobile application, SoC, energy efficiency, Back-Bias
Electronic systems of the future require a very high bandwidth communications infrastructure within the system. This way the massive amount of compute power which will be available can be inter-connected to realize future powerful advanced electronic systems. Today, electronic inter-connects between 3D chip-stacks, as well as intra-connects within 3D chip-stacks are approaching data rates of 100 Gbit/s soon. Hence, the question to be answered is how to efficiently design the communications infrastructure which will be within electronic systems. Within this paper approaches and results for building this infrastructure for future electronics are addressed.
The improvement of the computer system performance is constrained by the well-known memory wall and power wall. It has been recognized that the memory architecture and the interconnect architecture are becoming the overwhelming bottleneck in computer performance. Disruptive technologies, such as emerging non-volatile memory (NVM) technologies, 3D integration, and optical interconnects, are envisioned as promising future memory and interconnect technologies that can fundamentally change the landscape of the future computer architecture design with profound impact. This invited survey paper gives a brief introduction of these future memory and interconnect technologies, discusses the opportunities and challenges of these new technologies for future computer system designs.
Integrated Modular Avionics (IMA) architecture has been widely adopted by the avionics industry due to its strong temporal and spatial isolation capability for safety-critical real-time systems. The fundamental challenge to integrating an existing set of single-core IMA partitions into a multi-core system is to ensure that the isolation of the partitions will be maintained without incurring huge redevelopment and recertification costs. To address this challenge, we developed an optimized partition scheduling algorithm which considers exclusive regions to achieve the synchronization between partitions across cores. We show that the problem of finding the optimal partition schedule is NP-complete and present a Constraint Programming formulation. In addition, we relax this problem to find the minimum number of cores needed to schedule a given set of partitions and propose an approximation algorithm which is guaranteed to find a feasible schedule of partitions if there exists a feasible schedule of exclusive regions.
Applications that stream multiple video/audio or video+audio clips are being implemented in embedded devices. A Picture-in-Picture (PiP) application is one such application scenario, where two videos are played simultaneously. Although the PiP application is very efficiently handled in televisions and personal computers by providing maximum quality of service to the multiple streams, it is a difficult task in devices with resource constraints. In order to efficiently utilize the resources, it is essential to derive the necessary processor cycles for multiple video streams such that they are displayed with some prespecified quality constraint. Therefore, we propose a network calculus based formal framework to help schedule multiple media streams in the presence of buffer contraints. Further, our framework also presents a schedulability analysis condition to check if the multimedia streams can be scheduled such that a prespecified quality constraint is satisfied with the available service. We present this framework in the context of a PiP application, but it is applicable in general for multiple media streams. The results obtained using the formal framework were further verified using experiments involving system simulation.
This paper presents a methodology based on mathematical programming for the priority assignment of processes and messages in event-triggered systems with tight end-to-end real-time deadlines. For this purpose, the problem is converted into a Quadratically Constrained Quadratic Program (QCQP) and addressed with a state-of-the-art solver. The formulation includes preemptive as well as non-preemptive schedulers and avoids cyclic dependencies that may lead to intractable real-time analysis problems. For problems with stringent real-time requirements, the proposed mathematical programming method is capable of finding a feasible solution efficiently where other approaches suffer from a poor scalability. In case there exists no feasible solution, an algorithm is presented that uses the proposed method to find a minimal reason for the infeasibility which may be used as a feedback to the designer. To give evidence of the scalability of the proposed method and in order to show the clear benefit over existing approaches, a set of synthetic test cases is evaluated. Finally, a large realistic case study is introduced and solved, showing the applicability of the proposed method in the automotive domain.
In this work we present an experimental environment for electronic system-level design based on the OpenMP programming paradigm. Fully compliant with the OpenMP standard, the environment allows the generation of heterogeneous hardware/software systems exhibiting good scalability with respect to the number of threads and limited performance overheads. Based on well-established OpenMP benchmarks, the paper also presents some comparisons with high-performance software implementations as well as with previous proposals oriented to pure hardware translation. The results confirm that the proposed approach achieves improved results in terms of both efficiency and scalability.
Automatic C-to-RTL (C2RTL) synthesis can greatly benefit hardware design for streaming applications. However, stringent throughput/ area constraints, especially the demand for power optimization at the system level is rather challenging for existing C2RTL synthesis tools. This paper considers a power-aware C2RTL framework using voltage-frequency islands (VFIs) to address these challenges. Given the throughput, area, and power constraints, an MILP-based approach is introduced to synthesize C-code into an RTL design by simultaneously considering three design knobs, i.e., partition, parallelization, and VFI assignment to get the global optimal solution. A heuristic solution is also discussed to deal with the scalability challenge facing the MILP formulation. Experimental results based on four well known multimedia applications demonstrate the effectiveness of both solutions.
In this paper, we investigate how to use the complete flexibility of P-circuits, which realize a Boolean function by projecting it onto overlapping subsets given by a generalized Shannon decomposition. It is known how to compute the complete flexibility of P-circuits, but the algorithms proposed so far for its exploitation do not guarantee to find the best implementation, because they cast the problem as the minimization of an incompletely specified function. Instead, here we show that to explore all solutions we must set up the problem as the minimization of a Boolean relation, because there are don't care conditions that cannot be expressed by single cubes. In the experiments we report major improvements with respect to the previously published results.
In the IC industry, chip design cycles are becoming more
compressed, while designs themselves are growing in complexity.
These trends necessitate efficient methods to handle late-stage
engineering change orders (ECOs) to the functional specification,
often in response to errors discovered after much of the
implementation is finished. Past ECO synthesis algorithms have
typically treated ECOs as functional errors and applied error
diagnosis techniques to solve them. However, error diagnosis
methods are primarily geared towards finding a single change, and
moreover, tend to be computationally complex. In this paper, we
propose a unique methodology that can systematically incorporate
human intuition into the ECO process. Our methodology involves
finding a set of directly substitutable points known as functional
correspondences between the original implementation and the new
specification by using name-preserving synthesis and user hints, to
diminish the size of the ECO problem. On average, our approach
can reduce the size of logic changes by 94% from those reported in
current literature. We then incorporate our logic ECO changes into
an incremental physical synthesis flow to demonstrate its usability
in an industrial setting. Our ECO synthesis methodology is
evaluated on high-performance industrial designs. Results indicate
that post-ECO worst negative slack (WNS) improved 14% and
total negative slack (TNS) improved 46% over pre-ECO.
Keywords - Engineering Change Order, Logic Synthesis, Physical Synthesis
Soft error has become a critical reliability issue in nanoscale integrated circuits, especially in sequential circuits where a latched error will be propagated for many cycles and affect many outputs at different time. Retiming is a structural operation that relocates registers in a circuit without changing its functionality. In this paper, the effect of retiming on soft error rate (SER) of a sequential circuit is investigated considering both logic masking and timing masking. A minimum observability retiming problem under error-latching window constraints is formulated to reduce the SER of the circuit. And an efficient algorithm is proposed to solve the problem optimally. Experimental results show on average a 32.7% reduction on SER from the original circuits and a 15% improvement over the existing method.
We present a novel class of decision diagrams, called Biconditional Binary Decision Diagrams (BBDDs), that enable efficient logic synthesis for XOR-rich circuits. BBDDs are binary decision diagrams where the Shannon's expansion is replaced by the biconditional expansion. Since the biconditional expansion is based on the XOR/XNOR operations, XOR-rich logic circuits are efficiently represented and manipulated with canonical Reduced and Ordered BBDDs (ROBBDDs). Experimental results show that ROBBDDs have 37% fewer nodes on average compared to traditional ROBDDs. To exploit this opportunity in logic synthesis for XOR-rich circuits, we developed a BBDD-based One-Pass Synthesis (OPS) methodology. The BBDD-based OPS is capable to harness the potential of novel XOR-efficient devices, such as ambipolar transistors. Experimental results show that our logic synthesis methodology reduces the number of ambipolar transistors by 49.7% on average with respect to state-of-art commercial logic synthesis tool. Considering CMOS technology, the BBBD-based OPS reduces the device count by 31.5% on average compared to commercial synthesis tool.
In this work we advocate the adoption of Binary decision Diagrams (BDDs) for storing and manipulating Time-Series datasets. We first propose a generic BDD transformation which identifies and removes 50% of all BDD edges without any loss of information. Following, we optimize the core operation for adding samples to a dataset and characterize its complexity. We iidentify time-range queries as one of the core operations executed on time-series datasets, and describe explicit Boolean function constructions that aid in efficiently executing them directly on BDDs. We exhibit significant space and performance gains when applying our algorithms on synthetic and real-life biosensor time-series datasets collected from field trials.
In the nanometer era, runtime variations due to workload dependent voltage and temperature variations as well as transistor aging introduce remarkable uncertainty and unpredictability to nanoscale VLSI designs. Consideration of short-term and long-term workload-dependent runtime variations at design time and the interdependence of various parameters remain as major challenges. Here, we propose a static timing analysis framework to accurately capture the combined effects of various workload-dependent runtime variations happening at different time scales, making the link between system-level runtime effects and circuit-level design. The proposed framework is fully integrated with existing commercial EDA toolset, making it scalable for very large designs. We observe that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions is optimistic and results in considerable underestimation of timing margin
The mesh interconnection network has been preferred by the Network-on-Chip (NoC) community due to its simple implementation, high bandwidth and overall scalability. Most existing mesh-based NoC designs operate the mesh at the same or lower clock speed as the processing elements (PEs). Recently, a new source synchronous ring-based NoC architecture has been proposed, which runs significantly faster than the PEs and offers a significantly higher bandwidth and lower communication latency. The authors implement the NoC topology as a mesh of rings, which occupies the same area as that of a mesh. In this work, we evaluate two alternate source synchronous ring-based NoC topologies called the ring of stars (ROS) and the spine with rings (SWR), which occupy a much lower area, and are able to provide better performance in terms of communication latency compared to a state of the art mesh. In our proposed topologies, the clock and the data NoC are routed in parallel, yielding a fast, synchronous, robust design. Our design allows the PEs to extract a low jitter clock from the high speed ring clock by division. The area and performance of these ring-based NoC topologies is quantified. Experimental results on synthetic traffic show that the new ring-based NoC designs can provide significantly lower latency (upto 4.6x) compared to a state of the art mesh. The proposed floorplan-friendly topologies use fewer buffers (upto 50% less) and lower wire length (upto 64.3% lower) compared to the mesh. Depending on the performance and the area desired, a NoC designer can select among the topologies presented.
The emergence of power efficient heterogeneous NoCs presents an intriguing challenge in NoC reliability, particularly due to aging degradation. To effectively tackle this challenge, this work presents a dynamic routing algorithm that exploits the architecture level criticality of network packets while routing. Our proposed framework uses a Wearout Monitoring System (to track NBTI effect) and architecture-level criticality information to create a routing policy that restricts aging degradation with minimal impact on system level performance. Compared to the state-of-the-art BRAR (Buffered-Router Aware Routing), our best scheme achieves 38%, 53% and 29% improvements on network latency, system performance and Energy Delay Product per Flit (EDPPF) overheads, respectively.
Networks-on-Chip (NoCs) are a key component for the new many-core architectures, from the performance and reliability standpoints. Unfortunately, continuous scaling of CMOS technology poses severe concerns regarding failure mechanisms such as NBTI and stress-migration. Process variation makes harder the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper presents a novel cooperative sensor-wise methodology to reduce the NBTI degradation in the network on-chip (NoC) virtual channel (VC) buffers, considering process variation effects as well. The changes introduced to the reference NoC model exhibit an area overhead below 4%. Experimental validation is obtained using a cycle accurate simulator considering both real and synthetic traffic patterns. We compare our methodology to the best sensor-less round-robin approach used as reference model. The proposed sensor-wise strategy achieves up to 26.6% and 18.9% activity factor improvement over the reference policy on synthetic and real traffic patterns respectively. Moreover a net NBTI Vth saving up to 54.2% is shown against the baseline NoC that does not account for NBTI.
Network interfaces (NIs) are used in multi-core
systems where they connect processors, memories, and other IP-cores
to a packet switched Network-on-Chip (NOC). The functionality
of a NI is to bridge between the read/write transaction
interfaces used by the cores and the packet-streaming interface
used by the routers and links in the NOC. The paper addresses
the design of a NI for a NOC that uses time division multiplexing
By keeping the essence of TDM in mind, we have developed
a new area-efficient NI micro-architecture. The new design
completely eliminates the need for FIFO buffers and credit based
flow control - resources which are reported to account for 50-85%
of the area in existing NI designs. The paper discusses the
design considerations, presents the new NI micro-architecture,
and reports area figures for a range of implementations.
Index Terms - Multiprocessor interconnection networks; Realtime systems; Time division multiplexing;
Network congestion is a critical issue of memory
parallelism in network-based manycore systems where multiple
memories can be accessed simultaneously. Therefore, a
congestion-aware method is necessitated to deal with the network
congestion. In this paper, we present a streamlined method in
order to reduce the network congestion. The idea is to use the
global congestion information as a metric in network interfaces
to reduce the congestion level of highly congested areas. Network
interfaces connected to memory modules are equipped with an
adaptive scheduler using the global congestion information to
reduce additional traffic to congested areas. Experimental results
with synthetic test cases demonstrate that the on-chip network
utilizing the proposed adaptive scheduler presents up to 23%
improvement in average latency.
Keywords: Network-on-Chip, Congestion-Aware Scheduler, Network Interfaces
Designed by STMicroelectronics for the embedded market, the STxP70 processor is a small but extensible processor: designers have the possibility to define tightly-coupled extensions that can be reused in different designs. We explain why this modularity has a strong impact on the toolchain, detail the hardware/software flows and give results for two extensions.
We present a power optimization methodology that provides a fast and accurate power model for programmable architectures. The approach is based on a new tool that estimates power consumption from a register transfer level (RTL) module description, activity files and technology library. It efficiently provides an instruction-level accurate power model and allows design space exploration for the register file. We demonstrate a 19% improvement for a standard RISC processor.
A fast and accurate statistical method that
estimates at gate level the leakage power consumption of
CMOS digital circuits is demonstrated. Means, variances
and correlations of logic gate leakages are extracted at
library characterization step, and used for subsequent
circuit statistical computation. In this paper, the
methodology is applied to an eleven thousand cells ST test
IP. The circuit leakage analysis computation time is 400
times faster than a single fast-Spice corner analysis, while
providing coherent results.
Index Terms- Static Power, 32nm, leakage variability, correlation coefficients, covariance method, statistical leakage estimation
We developed a method that predicts the required number of cores for executing threads in the near future on a many-core processor. It is designed for low power consumption without performance degradation. The evaluation result confirmed the proposed method is effective on a 32-cores processor.
Powertrain controllers are automotive applications
that bring real-time constraints on software treatments based on
the angular position of tooth in the motor. Theses constraints
depend on the engine speed and can be as short as 100μsec. A
time-triggered approach provides a predictable and reproducible
execution of real-time systems but cannot cope with so tight
constraints and does not allow to directly specifying the temporal
behavior of the system based on angles. The contribution of this
work is to present how the ability of the PharOS technology to
combine several time domains (time and angle triggered) allows
designing and executing powertrain controllers in a deterministic
way on multi-core architectures. To this end, we present a
prototype of a subset of a powertrain controller from Delphi
based on PharOS.
Index Terms - time-triggered, angle-triggered, automotive powertrain controller.
JPEG Encoding is a commonly performed application that is also very process and memory intensive, and not suited for low-power embedded systems with narrow data buses and small amounts of memory. An embedded system may also need to adapt its application in order to meet varying system constraints such as power, energy, time or bandwidth. We present here an extremely compact JPEG encoder that uses very few system resources, and which is capable of dynamically changing its Quality of Service (QoS) on the fly. The application was tested on a NIOS II core, AVR, and PIC24 microcontrollers with excellent results.
Defects in TSVs due to fabrication steps decrease the yield and reliability of 3D stacked ICs, hence these defects need to be screened early in the manufacturing flow. Before wafer thinning, TSVs are buried in silicon and cannot be mechanically contacted, which severely limits test access. Although TSVs become exposed after wafer thinning, probing on them is difficult because of TSV dimensions and the risk of probe-induced damage. To circumvent these problems, we propose a non-invasive method for pre-bond TSV test that does not require TSV probing. We use open TSVs as capacitive loads of their driving gates and measure the propagation delay by means of ring oscillators. Defects in TSVs cause variations in their RC parameters and therefore lead to variations in the propagation delay. By measuring these variations, we can detect resistive open and leakage faults. We exploit different voltage levels to increase the sensitivity of the test and its robustness against random process variations. Results on fault detection effectiveness are presented through HSPICE simulations using realistic models for 45nm CMOS technology. The estimated DfT area cost of our method is negligible for realistic dies.
We propose a new method to derive a small number of LFSR seeds for Logic BIST to cover all detectable faults as a first-order satisfiability problem involving extended theories. We use an SMT (Satisfiability Modulo Theories) formulation to efficiently combine the tasks of test-generation and seed-computation. We make use of this formulation in an iterative seed-reduction flow which enables the "chaining" of hard-to-test faults using very few seeds. Experimental results demonstrate that up to 79% reduction in the number of seeds can be achieved. Index Terms - LFSR Reseeding, Logic BIST, Test generation, Satisfiability Modulo Theories.
This paper presents new scan solutions with low latency
overhead and on-line monitoring support. Shadow flip-flops
with scan design are associated to system flip-flops in order
to (a) provide concurrent delay fault detection and (b)
avoid the scan chain insertion of system flip-flops. A mixed
scan architecture is proposed which involves flip-flops with
shadow scan design at the end of timing-critical paths and flip-flops
with standard scan at non-critical locations. In order to
preserve system controllability during test, system flip-flops
with shadow scan can be set in scan mode and selectively reset
before switching to capture mode. It is shown that shadow scan
design with asynchronous set and reset may have a lower latency
overhead than standard scan design. A shadow scan solution
is proposed which, in addition to concurrent delay fault
detection, provides simultaneous scan and capture capability.
Keywords - shadow scan; dynamic variations; delay faults; monitoring; concurrent fault detection
The goal of fault diagnosis is to identify a set of candidate faults, or fault locations, that explain an observed faulty output response of a chip. In fault diagnosis procedures that are based on specific fault models, a scoring algorithm can be used for defining sets of candidate faults that include the faults with the highest scores. This paper shows that it is possible to capture the underlying concepts that make fault scoring effective through a graph, which is referred to as the dominance graph. With a test set T used for fault diagnosis, the graph represents the dominance relations between the equivalence classes obtained with respect to T . The observed response Robs of a chip-under-diagnosis is associated with an equivalence class Cobs , and Cobs is added to the dominance graph. A candidate fault set is defined based on the dominance relations that are added to the graph due to the addition of Cobs . Certain properties of these dominance relations point to the type of the defect present in the chip, and the most appropriate algorithm for defining a set of candidate faults based on it.
Power switches are increasingly becoming dominant leakage power reduction technique for sub-100nm CMOS technologies. Hence, fast and effective DFT solution for test and diagnosis of power switches is much needed to facilitate faster identification of potential faults and their locations. In this paper, we present a novel, coarse-grain DFT solution enabling divide and conquer based test and diagnosis solution of power switches. The proposed solution benefits from exponential time savings compared to previously reported solutions. Our DFT solution requires only (2 ⌈log2⌉ + 3) clock cycles in the worst case for test and diagnosis for m-segment power switches. These time savings are further substantiated by effective discharge circuit design, which eliminates the possibility of false test and hence significantly reducing the charge and discharge times. We validated the effectiveness of our proposed solution through SPICE simulations on a number of ISCAS benchmark circuits, synthesized using 90nm gate libraries.
Many cyber-physical systems comprise several control applications sharing communication and computation resources. The design of such systems requires special attention due to the complex timing behavior that can lead to poor control quality or even instability. The two main requirements of control applications are: (1) robustness and, in particular, stability and (2) high control quality. Although it is essential to guarantee stability and provide a certain degree of robustness even in the worst-case scenario, a design procedure which merely takes the worst-case scenario into consideration can lead to a poor expected (average-case) control quality, since the design is solely tuned to a scenario that occurs very rarely. On the other hand, considering only the expected quality of control does not necessarily provide robustness and stability in the worst-case. Therefore, both the robustness and the expected control quality should be taken into account in the design process. This paper presents an efficient and integrated approach for designing high-quality cyber-physical systems with robustness guarantees.
In this paper we study distributed automotive control applications whose tasks are mapped onto different ECUs communicating via a switched Ethernet network. As traditional automotive communication buses like CAN, FlexRay, LIN and MOST are gradually reaching their performance limits because of the increasing complexity of automotive architectures and applications, Ethernet-based in-vehicle communication systems have attracted a lot of attention in recent times. However, currently there is very little work on systematic timing analysis for Ethernet which is important for its deployment in safety-critical scenarios like in an automotive architecture. In this work, we propose a compositional timing analysis technique that takes various features of switched Ethernet into account like network topology, frame priorities, communication delay, memory requirement on switches, performance, etc. Such an analysis technique is particularly suitable during early design phases of automotive architectures and control software deployment. We demonstrate its use in analyzing mixed-criticality traffic patterns consisting of messages from performance-oriented control loops and timing-sensitive real-time tasks. We further evaluate the tightness of the obtained analytical bounds with an OMNeT++ based network simulation environment, which involves long simulation time and does not provide formal guarantees.
During the life cycle of a cyber-physical system, it is sometimes necessary to upgrade a working controller with a new, but unverified, one which provides better performance or additional functionality. To make sure that system invariants are not broken because of bugs in the new controller, an architecture is used in which both controllers are implemented on the platform, and a supervisor process checks that the actions of the new controller keep the system within its safe states. If an invariant may be violated, the supervisor switches over to the old controller that ensures correct behavior, but possibly degraded performance. A key question in the design of such supervisors is the switching strategy: when should the supervisor reinstate the new controller after it has switched to the old one? In general, one would prefer to use the new controller as much as possible, provided it does not violate safety. However, arbitrarily switching back to the new controller can cause the system to become unstable, even when each controller in isolation ensures stability. We provide a supervisor synthesis procedure that uses a simple counting strategy for the supervisor. Our synthesized supervisor ensures that switching between the controllers ensures stability of the system, while maintaining its safety properties and providing a lower bound on the use of the new controller. We prove the correctness of the strategy and show on an example that it can provide close to optimal use of the new controller against many disturbance scenarios.
In event triggered control systems, events occur aperiodically. For the real-time analysis of such systems, an appropriate approximation of the events' stimulation is necessary. Upper bounds have already been found for event triggered systems. For now, lower bounds have been assumed zero within the real-time analysis of event triggered systems. This work derives an approximated lower bound representing the maximum inter-sampling time. The bounds depend on the control system and the event generating mechanism. The beneficial effect is shown by analyzing an event triggered control system in a real-time analysis framework.
Networked control systems are a well-known sub-set of cyber-physical systems in which the plant is controlled by sending commands through a digital packet-based network. Current control networks provide advanced channel access mechanisms to guarantee low delay on a limited fraction of packets (low-delay class) while the other packets (un-protected class) experience a higher delay which increases with channel utilization. We investigate the extension of model predictive control to choose both the command value and its assignment to one of the two classes according to the predicted state of the plant and the knowledge of network condition. Experimental results show that more commands are assigned to the low-delay class when either the tracking error is high or the network condition is bad.
Automotive software mostly consists of a set of applications controlling the vehicle dynamics, engine and many other processes or plants. Since automotive systems design is highly cost driven, an important goal is to maximize the number of control applications to be packed onto a single processor or electronic control unit (ECU). Current design methods start with a controller design step, where the sampling period and controller gain values are decided based on given control performance objectives. However, operating systems (OS) on the ECU (e.g., ERCOSek) are usually pre-configured and offer only a limited set of sampling periods. Hence, a controller is implemented using an available sampling period, which is the shorter period closest to the one determined in the controller design step. However, this increases the load on the ECU (i.e., the processor runs the controller more often than what is actually required by design). This reduces the number of applications that can be mapped, and increases costs of the system. To overcome this predicament, we propose a multirate controller, which switches between multiple available sampling periods offered by the OS on the ECU. Apart from meeting all control objectives, this avoids the unnecessary ECU overload resulting from always sampling at a constant, higher rate.
One of the popular structural health monitoring (SHM) applications of both automotive and aeronautic fields is devoted to the non-destructive localization of impacts in plate-like structures. The aim of this paper is to develop a miniaturized, self-contained and low power device for automated impact detection that can be used in a distributed fashion without central coordination. The proposed device uses an array of four piezoelectric transducers, bonded to the plate, capable to detect the guided waves generated by an impact, to a STM32F4 board equipped with an ARM Cortex-M4 microcontroller and a IEEE802.15.4 wireless transceiver. The waves processing and the localization algorithm are implemented on-board and optimized for speed and power consumption. In particular, the localization of the impact point is obtained by cross-correlating the signals related to the same event acquired by the different sensors in the warped frequency domain. Finally the performance of the whole system is analysed in terms of localization accuracy and power consumption, showing the effectiveness of the proposed implementation.
Counterfeiting is no longer limited to just fashion or
luxury goods, the phenomenon has now reached electronics
components which failure represents a high risk to the safety and
security of human communities. One way for the semiconductor
(SC) industry to fight against counterfeiting of electronic parts is
to add technological innovation at the component level itself. The
target is to enable the product authentication in a fast and
reliable way. Because semiconductor manufacturing is a complex
and delicate operation producing highly complex products which
are sensitive to many environmental factors, any introduction of
changes in its production - which the implementation of anti-counterfeiting
(A/C) technologies must also comply to - must
undergo thorough testing and qualification steps. This is
mandatory to control the compliancy to the strict delivery
requirements, quality and reliability level the industry has
established, in line with the product performance specifications.
This paper aims to explain the comprehensive requirements
specification developed by members of semiconductor and
related industries in Europe, to add authentication technologies
solutions into IC packages. It also describes the qualification
processes and testing plans to implement the most adequate and
effective anti-counterfeiting technology (A/T). One of the main
challenges in this A/C task is to make sure that the added A/C
feature in electronic components does not create any additional
reliability or failure issue, nor introduce additional risks that will
Keywords - Anti-counterfeiting technologies, authentication, remarking, re-packaging, component counterfeiting, failure analysis, failure prevention, reliability testing
Counterfeiting of goods and electronic devices is a growing problem that has a huge economic impact on the electronics industry. Sometimes the consequences are even more dramatic, when critical systems start failing due to the use of counterfeit lower quality components. Hardware Intrinsic security (i.e. security systems built on the unique electronic fingerprint of a device) offers the potential to reduce the counterfeiting problem drastically. In this paper we will show how Hardware Intrinsic Security (HIS) can be used to prevent various forms of counterfeiting and over-production. HIS technology can also be used to bind software or user data to specific hardware devices, which provides additional security to both soft- and hardware vendors as well as consumers using HIS-enabled products. Besides showing the benefits of HIS, we will also provide an extensive overview of the results (both scientific and industrial) that Intrinsic-ID has achieved studying and implementing HIS.
Designing sustainable energy policies heavily impacts the economic development, environmental resource management and social acceptance. There are four main steps in the policy making process: planning, environmental assessment, implementation and monitoring. We focus here on the first three steps that are performed ex-ante. We describe in this paper these steps tailored on the energy policy process. We also propose enabling technologies for implementing a decision support system for energy policy making.
The world is facing several challenges that must
be dealt within the coming years such as efficient energy
management, need for economic growth, security and quality
of life of its habitants. The increasing concentration of the
world population into urban areas puts the cities in the center of
the preoccupations and makes them important actors for the
world's sustainable development strategy. ICT has a substantial
potential to help cities to respond to the growing demands of
more efficient, sustainable, and increased quality of life in the
cities, thus to make them "smarter". Smartness is directly
proportional with the "awareness". Cyber-physical systems can
extract the awareness information from the physical world and
process this information in the cyber-world. Thus, a holistic
integrated approach, from the physical to the cyber-world is
necessary for a successful and sustainable smart city outcome.
This paper introduces important research challenges that we
believe will be important in the coming years and provides
guidelines and recommendations to achieve self-aware smart
Keywords - Cyber-physical systems, Autonomic computing, Self-aware systems, Smart city
The recent research efforts in smart grids and residential
power management are oriented to monitor pervasively
the power consumption of appliances in domestic and non-domestic
buildings. Knowing the status of a residential grid
is fundamental to keep high reliability levels while real time
monitoring of electric appliances is important to minimize power
waste in buildings and to lower the overall energy cost. Wireless
Sensor Networks (WSNs) are a key enabling technology for this
application field because they consist of low-power, non-invasive
and cost-effective intelligent sensor devices. We present a wireless
current sensor node (WCSN) for measuring the current drawn
by single appliances. The node can self-sustain its operations
by harvesting energy from the monitored current. Two AAA
batteries are used only as secondary power supply to guarantee
a fast start-up of the system. An active ORing subsystem selects
automatically the suitable power source, minimizing power losses
typical of the classic diode configuration. The node harvests energy
when the power consumed by the device under measurement
is in the range 10W÷10kW, which also corresponds to the range
of current 50mA÷50A drawn directly from the main. Finally
the node features a low-power, 32-bit microcontroller for data
processing and a wireless transceiver to send data via the IEEE
802.15.4 standard protocol.
Index Terms - Wireless sensor networks, smart metering, energy harvesting, active ORing, energy measuring.
Transaction level models (TLMs) can use temporal decoupling to increase the simulation speed. However, there is a lack of modeling support to time the temporally decoupled TLMs. In this paper, we propose a timing estimation mechanism for TLMs with temporal decoupling. This mechanism features an analytical model and novel delay formulas. Concepts such as resource usage and availability are used to derive the delay formulas. Based on them, a fast scheduling algorithm resolves resource conflicts and dynamically determines the timing of concurrent transaction sequences. Experiments show that the delay estimation formulas are capable of capturing the timing effects of resource conflicts. At the same time, the overhead of the scheduling algorithm is very low, hence the simulation speed remains high.
The timing predictability of embedded systems with hard real-time requirements is fundamental for guaranteeing their safe usage. With the emergence of multicore platforms this task became very challenging. In this paper, a model-checking based approach will be described which allows us to guarantee timing bounds of multiple Synchronous Data Flow Graphs (SDFG) running on shared-bus multicore architectures. Our approach utilizes Timed Automata (TA) as a common semantic model to represent software components (SDF actors) and hardware components of the multicore platform. These TA are explored using the UPPAAL model-checker for providing the timing guarantees. Our approach shows a significant precision improvement compared with the worst-case bounds estimated based on maximal delay for every bus access. Furthermore, scalability is examined to demonstrate analysis feasibility for small parallel systems.
High-level architecture modeling languages, such as
Architecture Analysis & Design Language (AADL), are gradually
adopted in the design of embedded systems so that design choice
verification, architecture exploration, and system property checking
are carried out as early as possible. This paper presents our
recent contributions to cope with clock-based timing analysis and
validation of software architectures specified in AADL. In order
to avoid semantics ambiguities of AADL, we mainly consider the
AADL features related to real-time and logical time properties.
We endue them with a semantics in the polychronous model of
computation; this semantics is quickly reviewed. The semantics
enables timing analysis, formal verification and simulation. In
addition, thread-level scheduling, based on affine clock relations
is also briefly presented here. A tutorial avionic case study,
provided by C-S, has been adopted to illustrate our overall
Keywords - AADL; MDE; Polychrony; timing analysis
Modern chip designs are getting more and more complex. To fulfill tight time-to-market constraints, third-party blocks and parts from previous designs are reused. However, these are often poorly documented, making it hard for a designer to understand the code. Therefore, automatic approaches are required which extract information about the design and support developers in understanding the design. In this paper we introduce a new dynamic data flow analysis tuned to automate design understanding. We present the use of the approach for feature localization and for understanding the design's data flow. In the evaluation, our analysis improves feature localization by reducing the uncertainty by 41% to 98% compared to a previous approach using coverage metrics.
A known approach to improve the timing accuracy of an untimed or loosely timed TLM model is to add timing annotations into the code and to reduce the number of costly context switches using temporal decoupling, meaning that a process can go ahead of the simulation time before synchronizing again. Our current goal is to apply temporal decoupling to the TLM platform of a heterogeneous many-core SoC dedicated to high performance computing. Part of this SoC communicates using classic memory-mapped buses, but it can be extended with hardware accelerators communicating using FIFOs. Whereas temporal decoupling for memory-based transactions has been widely studied, FIFO-based communications raise issues that have not been addressed before. In this paper, we provide an efficient solution to combine temporal decoupling and FIFO-based communications.
Modeling languages such as UML or SysML received significant attention over the last years. They allow for an abstract description of systems already in the absence of a precise implementation or a hardware/ software partitioning. Additionally considering textual constraints, for example provided by means of OCL, enables to automatically check the specified systems e.g. for consistency of the structure or reachability of certain system states. However, for the majority of verification tasks, not the entire model has to be considered. In this work, we propose an approach that automatically determines reduced system models, i.e. system descriptions that only include model elements which are relevant for the considered verification task. Considering reduced models eases the access by the designer and supports incremental design and verification schemes. But most important, they improve the efficiency of the applied formal verification engine. Experiments demonstrate that already small reductions in the model lead to significant accelerations in the run-time of the verification engine.
The use of modeling languages such as UML or SysML enables to formally specify and verify the behavior of digital systems already in the absence of a specific implementation. However, for each modeling method and verification task usually a separate verification solution has to be applied today. In this paper, a methodology is envisioned that aims at stopping this "inflation" of different verification approaches and instead employs a generic methodology. For this purpose, a given specification as well as the verification shall be transformed into a basic model which itself is specified by means of a generic modeling language. Then, a range of automatic reasoning engines shall uniformly be applied to perform the actual verification. A feasibility study demonstrates the applicability of the envisioned approach.
We explore a miniature sensor node that could be placed in an environment which would interrogate, take decisions and transmit autonomously and seamlessly without the need of a battery. With the system completely powered by an energy harvester for autonomous operation, the power management becomes crucial. In this paper, we propose an ultra low power management circuit implemented in 0:18μm CMOS technology. As part of a stringent power requirements and very limited power offered by the energy harvesters, the proposed circuit provides a nanowatt power management scheme. Using postlayout simulation, we have evaluated the power consumption of the proposed power management unit (PMU) and report results that compares favorably to the state of the art.
In this paper, a bio-inspired technique of finding
the regions of highest visual importance within an image is
proposed for reducing power consumption in modern liquid
crystal displays (LCDs) that utilize a 2D light-emitting diode
(LED) backlighting system. The conspicuity map generated from
this neuromorphic saliency model, along with an adaptive
dimming method, is applied to the backlighting array to reduce
the luminance of regions of least interest as perceived by a
human viewer. Corresponding image compensation is applied to
the saliency modulated image to minimize distortion and retain
the original image quality. Experimental results shows average
65% power can be saved when the original display system is
integrated with a low-overhead real-time hardware
implementation of the saliency model.
Keywords - FPGA and ASIC design; LED; LCD; system level power management
Power gating is one of the most effective solutions available to reduce leakage power. However, power gating is not practically usable in an active mode due to the overheads of inrush current and data retention. In this work, we propose a data-retained power gating (DRPG) technique which enables power gating of flip-flops during active mode. More precisely, we combine clock gating and power gating techniques, with the flip-flops being power-gated during clock masked periods. We introduce a retention switch which retains data during the power gating. With the retention switch, correct logic states and functionalities are guaranteed without additional control circuitry. The proposed technique can achieve significant active-mode leakage reduction over conventional designs with small area and performance overheads. In studies with a 65nm foundry library and open-source benchmarks, DRPG achieves up to 25.7% active-mode leakage savings (11.8% savings on average) over conventional designs.
On-chip physical thermal sensors play a vital role for accurately estimating the full-chip thermal profile. How to place physical sensors such that both the number of thermal sensors and the temperature estimation errors are minimized becomes important for on-chip dynamic thermal management of today's high-performance microprocessors. In this paper, we present a new systematic thermal sensor placement algorithm. Different from the traditional thermal sensor placement algorithms where only the temperature information is explored, the new placement method takes advantage of functional unit power information by exploiting the correlation of power estimation errors among functional blocks. The new power-driven placement algorithm applies the correlation clustering algorithm to determine both the locations of sensors and the number of sensors automatically such that the temperature estimation errors can be minimized. Experimental results on a dual-core architecture show that the new thermal sensor placements yield more accurate full-chip temperature estimation compared to the uniform and the k-means based placement approaches.
By integrating multiple processing units and memories on a single chip, multiprocessor system-on-chip (MPSoC) can provide higher performance per energy and lower cost per function to applications with growing complexity. In order to maintain the power budget, power gating technique is widely used to reduce the leakage power. However, it will introduce significant power/ground (P/G) noises, and threat the reliability of MPSoCs. With significant area, power and performance overheads, traditional methods rely on reinforced circuits or fixed protection strategies to reduce P/G noises caused by power gating. In this paper, we propose a systematic approach to actively alleviating P/G noises using the parasitic capacitance of on-chip memories through sensor network on-chip (SENoC). We utilize the parasitic capacitance of on-chip memories as dynamic decoupling capacitance to suppress P/G noises and develop a detailed Hspice model for related study. SENoC is developed to not only monitor and report P/G noises but also coordinate processing units and memories to alleviate such transient threats at run time. Extensive evaluations show that compared with traditional methods, our approach saves 11.7% to 62.2% energy consumption and achieves 13.3% to 69.3% performance improvement for different applications and MPSoCs with different scales. We implement the circuit details of our approach and show its low area and energy consumption overheads.
Cycle life of a battery largely varies according to the battery operating conditions, especially the battery temperature. In particular, batteries age much faster at high temperature. Extensive experiments have shown that the battery temperature varies dramatically during continuous charge or discharge process. This paper introduces a forced convection cooling technique for the batteries that power a portable system. Since the cooling fan is also powered by the same battery, it is critical to develop a highly effective, low power-consuming solution. In addition, there is a fundamental tradeoff between the service time of a battery equipped with fans and the cycle life of the same battery. In particular, as the fan speed is increased, the power dissipated by the fan goes up and hence the full charge capacity of the battery is lost at a faster rate, but at the same time, the battery temperature remains lower and hence the battery longevity increases. This is the first work that formulates the adaptive thermal management problem for batteries (ATMB) in portable systems and provides a systematic solution for it. A hierarchical algorithm combining reinforcement learning at the lower level and dynamic programming at the upper level is proposed to derive the ATMB policy.
Keywords - battery system; adaptive thermal management; forced convection cooling;
This paper presents a unique rotary oscillator array (ROA) topology - the sparse-ROA (SROA). The SROA eliminates the need for redundant rings in a typical, mesh-like rotary topology optimizing the global distribution network of the resonant clocking technology. To this end, a design methodology is proposed for SROA construction based on the distribution of the synchronous components. The methodology eliminates the redundant rings of the ROA and reduces the tapping wirelength, which leads to a power saving of 32.1%. Furthermore, a skew control function is implemented into the SROA design methodology as a part of the optimization of the connections among tapping points and subtree roots. This control function leads to a clock skew reduction of 47.1% compared to a square-shaped ROA network design, which is verified through HSPICE.
Improving circuit realization of known quantum
algorithms by CAD techniques has benefits for quantum experimentalists.
In this paper, we address the problem of synthesizing
a given k-input, m-output lookup table (LUT) by a reversible
circuit. This problem has interesting applications in the famous
Shor's number-factoring algorithm and in quantum walk on
sparse graphs. For LUT synthesis, our approach targets the
number of control lines in multiple-control Toffoli gates to
reduce synthesis cost. To achieve this, we propose a multi-level
optimization technique for reversible circuits to benefit from
shared cofactors. To reuse output qubits and/or zero-initialized
ancillae, we un-compute intermediate cofactors. Our experiments
reveal that the proposed LUT synthesis has a significant impact
on reducing the size of modular exponentiation circuits for Shor's
quantum factoring algorithm, oracle circuits in quantum walk
on sparse graphs, and the well-known MCNC benchmarks.
Keywords - Lookup tables; Logic synthesis; Reversible circuits; Shor's quantum number-factoring algorithm; Binary welded tree.
This paper demonstrates a fully functional hardware and software design for a 3D stacked multi-core system for the first time. Our 3D system is a low-power 3D Modular Multi-Core (3D-MMC) architecture built by vertically stacking identical layers. Each layer consists of cores, private and shared memory units, and communication infrastructures. The system uses shared memory communication and Through-Silicon-Vias (TSVs) to transfer data across layers. A serialization scheme is employed for inter-layer communication to minimize the overall number of TSVs. The proposed architecture has been implemented in HDL and verified on a test chip targeting an operating frequency of 400MHz with a vertical bandwidth of 3.2Gbps. The paper first evaluates the performance, power and temperature characteristics of the architecture using a set of software applications we have designed. We demonstrate quantitatively that the proposed modular 3D design improves upon the cost and performance bottlenecks of traditional 2D multi-core design. In addition, a novel resource pooling approach is introduced to efficiently manage the shared memory of the 3D stacked system. Our approach reduces the application execution time significantly compared to 2D and 3D systems with conventional memory sharing.
Spin-Transfer Torque RAM (STT-RAM) is extensively studied in recent years. Recent work proposed to improve the write performance of STT-RAM through relaxing the retention time of STT-RAM cell, magnetic tunnel junction (MTJ). Unfortunately, frequent refresh operations of volatile STT-RAM could dissipate significantly extra energy. In addition, refresh operations can severely conflict with normal read/write operations and results in degraded cache performance. This paper proposes Cache Coherence Enabled Adaptive Refresh (CCear) to minimize refresh operations for volatile STT-RAM. Through novel modifications to cache coherence protocol, CCear can effectively minimize the number of refresh operations on volatile STT-RAM. Full-system simulation results show that CCear approaches the performance of the ideal refresh policy with negligible overhead.
In this paper we address lower level issues related to 3D inter-die memory repair in an attempt to evaluate the actual potential of this approach for current and foreseeable technology developments. We propose several implementation schemes both for inter-die row and column repair and evaluate their impact in terms of area and delay. Our analysis suggests that current state-of-the-art TSV dimensions allow inter-die column repair schemes at the expense of reasonable area overhead. For row repair, however, most memory configurations require TSV dimensions to scale down at least with one order of magnitude in order to make this approach a possible candidate for 3D memory repair. We also performed a theoretical analysis of the implications of the proposed 3D repair schemes on the memory access time, which indicates that no substantial delay overhead is expected and that many delay versus energy consumption tradeoffs are possible.
The thermomechanical stress has been considered as one of the most challenging problems in three-dimensional integration circuits (3D ICs), due to the thermal expansion coefficient mismatch between the through-silicon vias (TSVs) and silicon substrate, and the presence of elevated thermal gradients. To address the stress issue, we propose a thorough solution that combines design-time and run-time techniques for the relief of thermomechanical stress and the associated reliability issues. A sophisticated TSV stress-aware floorplan policy is proposed to minimize the possibility of wafer cracking and interfacial delamination. In addition, the run-time thermal management scheme effectively eliminates large thermal gradients between layers. The experimental results show that the reliability of 3D design can be significantly improved due to the reduced TSV thermal load and the elimination of mechanical damaging thermal cycling pattern.
Split manufacturing of integrated circuits (IC) is being investigated as a way to simultaneously alleviate the cost of owning a trusted foundry and eliminate the security risks associated with outsourcing IC fabrication. In split manufacturing, a design house (with a low-end, in-house, trusted foundry) fabricates the Front End Of Line (FEOL) layers (transistors and lower metal layers) in advanced technology nodes at an untrusted high-end foundry. The Back End Of Line (BEOL) layers (higher metal layers) are then fabricated at the design house's trusted low-end foundry. Split manufacturing is considered secure (prevents reverse engineering and IC piracy) as it hides the BEOL connections from an attacker in the FEOL foundry. We show that an attacker in the FEOL foundry can exploit the heuristics used in typical floorplanning, placement, and routing tools to bypass the security afforded by straightforward split manufacturing. We developed an attack where an attacker in the FEOL foundry can connect 96% of the missing BEOL connections correctly. To overcome this security vulnerability in split manufacturing, we developed a fault analysis-based defense. This defense improves the security of split manufacturing by deceiving the FEOL attacker into making wrong connections.
One of the growing issues in IC design is how to
establish trustworthiness of chips fabricated by untrusted
vendors. Such process, often called Trojan detection, is
challenging since the specifics of hardware Trojans inserted by
intelligent adversaries are difficult to predict and most Trojans
do not affect the logic behavior of the circuit unless they are
activated. Also, Trojan detection via parametric measurements
becomes increasingly difficult with increasing levels of process
In this paper we propose a method that maximizes the
resolution of each path delay measurement, in terms of its ability
to detect the targeted Trojan. In particular, for each Trojan, our
approach accentuates the Trojan's impact by generating a vector
that sensitizes the shortest path passing via the Trojan's site. We
estimate the minimum number of chips to which each vector
must be applied to detect the Trojan with sufficient confidence
for a given level of process variations. Finally, we demonstrate
the significant improvements in effectiveness and cost provided
by our approach under high levels of process variations.
Experimental results on several benchmark circuits show that we
can achieve dramatic reduction in test cost using our approach
compared to classical path delay testing.
Keywords - Hardware Trojan; security; parametric test.
Vulnerability of modern integrated circuits (ICs) to hardware Trojans has been increasing considerably due to the globalization of semiconductor design and fabrication processes. The large number of parts and decreased controllability and observability to complex ICs internals make it difficult to efficiently perform Trojan detection using typical structural tests like path latency and leakage power. In this paper, we present new accurate methods for Trojan detection that are based upon post-silicon multimodal thermal and power characterization techniques. Our approach first estimates the detailed post-silicon spatial power consumption using thermal maps of the IC, then applies 2DPCA to extract features of the spatial power consumption, and finally uses statistical tests against the features of authentic ICs to detect the Trojan. To characterize real-world ICs accurately, we perform our experiments in presence of 20% - 40% CMOS process variation. Our results reveal that our new methodology can detect Trojans with 3-4 orders of magnitude smaller power consumptions than the total power usage of the chip, while it scales very well because of the spatial view to the ICs internals by the thermal mapping.
Integrated circuits (ICs) are now designed and fabricated in a globalized multi-vendor environment making them vulnerable to malicious design changes, the insertion of hardware trojans/malware and intellectual property (IP) theft. Algorithmic reverse engineering of digital circuits can mitigate these concerns by enabling analysts to detect malicious hardware, verify the integrity of ICs and detect IP violations. In this paper, we present a set of algorithms for the reverse engineering of digital circuits starting from an unstructured netlist and resulting in a high-level netlist with components such as register files, counters, adders and subtracters. Our techniques require no manual intervention and experiments show that they determine the functionality of more than 51% and up to 93% of the gates in each of the practical test circuits that we examine.
This work identifies a new formal basis for hardware information flow security by providing a method to separate timing flows from other flows of information. By developing a framework for identifying these different classes of information flow at the gate-level, one can either confirm or rule out the existence of such flows in a provable manner. To demonstrate the effectiveness of our presented model, we discuss its usage on a practical example: a CPU cache in a MIPS processor written in Verilog HDL and simulated in a scenario which accurately models previous cache-timing attacks. We demonstrate how our framework can be used to isolate the timing channel used in these attacks.
With continued scaling of NAND flash memory process
technology and multiple bits programmed per cell, NAND flash reliability
and endurance are degrading. Understanding, characterizing, and
modeling the distribution of the threshold voltages across different cells in
a modern multi-level cell (MLC) flash memory can enable the design of
more effective and efficient error correction mechanisms to combat this
degradation. We show the first published experimental measurement-based
characterization of the threshold voltage distribution of flash
memory. To accomplish this, we develop a testing infrastructure that uses
the read retry feature present in some 2Y-nm (i.e., 20-24nm) flash chips.
We devise a model of the threshold voltage distributions taking into
account program/erase (P/E) cycle effects, analyze the noise in the
distributions, and evaluate the accuracy of our model. A key result is that
the threshold voltage distribution can be modeled, with more than 95%
accuracy, as a Gaussian distribution with additive white noise, which shifts
to the right and widens as P/E cycles increase. The novel characterization
and models provided in this paper can enable the design of more effective
error tolerance mechanisms for future flash memories.
Index Terms - NAND Flash, Memory Reliability, Memory Signal Processing, Threshold Voltage Distribution, Read Retry
Massively repeated structures such as SRAM cells usually require extremely low failure rate. This brings on a challenging issue for Monte Carlo based statistical yield analysis, as huge amount of samples have to be drawn in order to observe one single failure. Fast Monte Carlo methods, e.g. importance sampling methods, are still quite expensive as the anticipated failure rate is very low. In this paper, a new method is proposed to tackle this issue. The key idea is to improve traditional importance sampling method with an efficient online surrogate model. The proposed method improves the performance for both stages in importance sampling, i.e. finding the distorted probability density function, and the distorted sampling. Experimental results show that the proposed method is 1e2X~1e5X faster than the standard Monte Carlo approach and achieves 5X~22X speedup over existing state-of-the-art techniques without sacrificing estimation accuracy.
Recent synchronizer metastability measurements indicate degradation of MTBF with technology scaling, calling for measurement and calibration circuits in 65nm and below. Degradation of parameters can be even worse if the system is operated at extreme supply voltages and temperature conditions. In this work we study the behavior of synchronizers in a broad range of supply voltage and temperature corners. A digital on-chip measurement system is presented that helps to characterize synchronizers in future technologies and a new calibrating system is shown that accounts for changes in delay values due to supply voltage and temperature changes. We present a detailed comparison of measurements and simulations for a fabricated 65nm bulk CMOS circuit and discuss implications of the measurements for synchronization systems in 65nm and beyond. We propose an adaptive self-calibrating synchronizer to account for supply voltage, temperature, global process variations and DVFS.
Scaling of device dimensions toward nano-scale regime has made it essential to innovate novel design techniques for improving the circuit robustness. This work proposes an implementation of adaptive proactive reconfiguration methodology that can first monitor process variability and BTI aging among 6T SRAM memory cells and then apply a recovery mechanism to extend the SRAM lifetime. Our proposed technique can extend the memory lifetime between 2X to 4.5X times with a silicon area overhead of around 10% for the monitoring units, in a 1kB 6T SRAM memory chip.
Optimal utilization of a multi-channel memory, such as Wide IO DRAM, as shared memory in multi-processor platforms depends on the mapping of memory clients to the memory channels, the granularity at which the memory requests are interleaved in each channel, and the bandwidth and memory capacity allocated to each memory client in each channel. Firm real-time applications in such platforms impose strict requirements on shared memory bandwidth and latency, which must be guaranteed at design-time to reduce verification effort. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multi-channel memories in real-time systems. This paper has four key contributions: (1) A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit. (2) A novel method for logical-to-physical address translation that enables interleaving memory requests across multiple memory channels at different granularities. (3) An optimal algorithm based on an Integer Linear Program (ILP) formulation to map memory clients to memory channels considering their communication dependencies, and to configure the memory controller for minimum bandwidth utilization. (4) We experimentally evaluate the run-time of the algorithm and show that an optimal solution can be found within 15 minutes for realistically sized problems. We also demonstrate configuring a multi-channel Wide IO DRAM in a High-Definition (HD) video and graphics processing system to emphasize the effectiveness of our approach.
Hierarchical scheduling of periodic resources has been increasingly applied to a wide variety of real-time systems due to its ability to accommodate various applications on a single system through strong temporal isolation. This leads to the question of how one can optimize over the resource parameters while satisfying the timing requirements of real-time applications. A great deal of research has been devoted to deriving the analytic model for the bounds on the design parameter of a single resource as well as its optimization. The optimization for multiple periodic resources, however, requires a holistic approach due to the conflicting requirements of the limited computational capacity of a system among resources. Thus, this paper addresses a holistic optimization of multiple periodic resources with regard to minimum system utilization. We extend the existing analysis of a single resource in order for the variable interferences among resources to be captured in the resource bound, and then solve the problem with Geometric Programming (GP). The experimental results show that the proposed method can find a solution very close to the one optimized via an exhaustive search and that it can explore more solutions than a known heuristic method.
Model-based design using synchronous reactive (SR) models is widespread for the development of embedded control software. SR models ease verification and validation, and enable the automatic generation of implementations. In SR models, synchronous finite state machines (FSMs) are commonly used to capture changes of the system state under trigger events. The implementation of a synchronous FSM may be improved by using multiple software tasks instead of the traditional single-task solution. In this work, we propose methods to quantitatively analyze task implementations with respect to a breakdown factor that measures the timing robustness, and an action extensibility metric that measures the capability to accommodate upgrades. We propose an algorithm to generate a correct and efficient task implementation of synchronous FSMs for these two metrics, while guaranteeing the schedulability constraints.
We consider software transactional memory (STM) concurrency control for embedded multicore real-time software, and present a novel contention manager for resolving transactional conflicts, called FBLT. We upper bound transactional retries and task response times under FBLT, and identify when FBLT has better real-time schedulability than the previous best contention manager, PNF. Our implementation in the Rochester STM framework reveals that FBLT yields shorter or comparable retry costs than competitor methods.
In today's real-time system design, a virtual prototype
can help to increase both the design speed and quality.
Developing a virtual prototyping platform requires realistic
modeling of the HW system, accurate simulation of the real-time
SW, and integration with a reactive real-time environment. Such
a VP simulation platform is often difficult to develop. In this
paper, we propose a case-study of autonomous two-wheeled robot
to show how to develop a virtual prototyping platform rapidly in
SystemC/TLM to adequately aid in the design of this instable
system with hard real-time constraints. Our approach is an
integration of four major model components. Firstly, an accurate
physical model of the robot is provided. Secondly, a virtual world
is modeled in Java that offers a 3D environment for the robot to
move in. Thirdly, the embedded control SW is developed. Finally,
the overall HW system is modeled in SystemC at transaction
level. This HW model wraps the physical model, interacts with
the virtual world, and simulates the real-time SW by integrating
an Instruction Set Simulator of the embedded CPU. By
integrating these components into a platform, designers can
efficiently optimize the embedded SW architecture, explore the
design space and check real-time conditions for different system
parameters such as buffer sizes, CPU frequency or cache
Keywords - Virtual Prototyping, Transaction Level Modeling, Real-time Constraints, Embedded Systems
Engine control units in the automotive industry are particular challenging real-time systems regarding their real-time analysis. Some of the tasks of such an engine control unit are triggered by the engine, i.e. the faster the angular velocity of the engine, the more frequent the tasks are executed. Furthermore, the execution time of a task may vary with the angular velocity of the engine. As a result the worst case does not necessarily occur when all tasks are activated simultaneously. Hence this behavior cannot be addressed appropriately with the currently available real-time analysis methods. In this paper we make a first step towards a real-time analysis for an engine control unit. We present a sufficient real-time analysis assuming that the angular velocity of the engine is arbitrary but fixed.
Chip microscale liquid-cooling reduces thermal
resistance and improves datacenter efficiency with higher coolant
temperatures by eliminating chillers and allowing thermal energy
re-use in cold climates. Liquid cooling enables an unprecedented
density in future computers to a level similar to a human brain.
This is mediated by a dense 3D architecture for interconnects,
fluid cooling, and power delivery of energetic chemical
compounds transported in the same fluid. Vertical integration
improves memory proximity and electrochemical power delivery
creating valuable space for communication. This strongly
improves large system efficiency thereby allowing computers to
grow beyond exa-scale.
Keywords - datacenter; energy; reuse; packaging; cooling; power-supply; stacking
Server consolidation plays a key role to mitigate the continuous power increase of datacenters. The recent advent of scale-out applications (e.g., web search, MapReduce, etc.) necessitate the revisit of existing server consolidation solutions due to distinctively different characteristics compared to traditional high-performance computing (HPC), i.e., user interactive, latency critical, and operations on large data sets split across a number of servers. This paper presents a power saving solution for datacenters that especially targets the distinctive characteristics of the scale-out applications. More specifically, we take into account correlation information of core utilization among virtual machines (VMs) in server consolidation to lower actual peak server utilization. Then, we utilize this reduction to achieve further power savings by aggressively-yet-safely lowering the server operating voltage and frequency level. We have validated the effectiveness of the proposed solution using 1) multiple clusters of real-life scale-out application workloads based web search and 2) utilization traces obtained from real datacenter setups. According to our experiments, the proposed solution provides up to 13.7% power savings with up to 15.6% improvement of Quality-of-Service (QoS) compared to existing correlation-aware VM allocation schemes for datacenters.
An increasing amount of information technology services and data are now hosted in the cloud, primarily due to the cost and scalability benefits for both the end-users and the operators of the warehouse-scale datacenters (DCs) that host cloud services. Hence, it is vital to continuously improve the capabilities and efficiency of these large-scale systems. Over the past ten years, capability has improved by increasing the number of servers in a DC and the bandwidth of the network that connects them. Cost and energy efficiency have improved by eliminating the high overheads of the power delivery and cooling infrastructure. To achieve further improvements, we must now examine how well we are utilizing the servers themselves, which are the primary determinant for DC performance, cost, and energy efficiency. This is particularly important since the semiconductor chips used in servers are now energy limited and their efficiency does not scale as fast as in the past. This paper motivates the need for resource efficient computing in large-scale datacenters and reviews the major challenges and research opportunities.
General purpose graphics processing units (GPGPUs) have recently been explored as a new computing paradigm for accelerating compute-intensive EDA applications. Such massively parallel architectures have been applied in accelerating the simulation of digital designs during several phases of their development - corresponding to different abstraction levels, specifically: (i) gate-level netlist descriptions, (ii) register-transfer level and (iii) transaction-level descriptions. This embedded tutorial presents a comprehensive analysis of the best results obtained by adopting GP-GPUs in all these EDA applications.
Many applications are inherently resilient to in-exactness
or approximations in their underlying computations.
Approximate circuit design is an emerging paradigm that exploits
this inherent resilience to realize hardware implementations that
are highly efficient in energy or performance.
In this work, we propose Substitute-And-SIMplIfy (SASIMI),
a new systematic approach to the design and synthesis of approximate
circuits. The key insight behind SASIMI is to identify
signal pairs in the circuit that assume the same value with
high probability, and substitute one for the other. While these
substitutions introduce functional approximations, if performed
judiciously, they result in some logic to be eliminated from the
circuit while also enabling downsizing of gates on critical paths
(simplification), resulting in significant power savings. We propose
an automatic synthesis framework that performs substitution
and simplification iteratively, while ensuring that a user-specified
quality constraint is satisfied. We extend the proposed framework
to perform automatic synthesis of quality configurable circuits that
can dynamically operate at different accuracy levels depending
on application requirements. We used SASIMI to automatically
synthesize approximate and quality configurable implementations
of a wide range of arithmetic units (Adders, Multipliers, MAC),
complex data paths (SAD, FFT butterfly, Euclidean distance) and
ISCAS85 benchmarks, using various error metrics such as error
rate and average error magnitude. The synthesized approximate
circuits demonstrate power improvements of 10%-28% for tight
error constraints, and 30%-60% for relaxed error constraints. The
quality configurable circuits obtain between 14%-40% improvement
in energy in the approximate mode, while incurring no energy
overheads in the accurate mode.
Index Terms - Low Power Design, Approximate Computing, Approximate Circuits, Logic Synthesis
System reliability is a crucial concern especially in multicore systems which tend to have high power density and hence temperature. Existing reliability-aware methods are either slow and non-adaptive (offline techniques) or do not use task assignment and scheduling to compensate for uneven core wear states (online techniques). In this article, we present a dynamically-activated task assignment and scheduling algorithm based on theoretical results that explicitly optimizes system lifetime. We also propose a data distillation method that dramatically reduces the size of the thermal profiles to make full system reliability analysis viable online. Simulation results show that our algorithm results in between 27-291% improvement to system lifetime compared to existing techniques for four-core systems.
By combining analytical and numerical simulation techniques, this work develops a hybrid thermal simulator, NUMANA, which can effectively deal with complicated material structures, to estimate the temperature profile of a 3-D IC. Compared with a commercial tool, ANSYS, its maximum relative error is only 1.84%. Compared with a well known linear system solver, SuperLU , it can achieve orders of magnitude speedup.
The high heat flux and compact structure of three-dimensional
circuits (3D ICs) make conventional air-cooled
devices more subsceptible to overheating. Liquid cooling is
an alternative that can improve heat dissipation, and reduce
thermal issues. Fast and accurate thermal models are needed
to appropriately dimension the cooling system at design time.
Several models have been proposed to study different designs,
but generally with low simulation performance. In this paper,
we present an efficient model of the transient thermal behaviour
of liquid-cooled 3D ICs. In our experiments, our approach is 60
times faster and uses 600 times less memory than state-of-the-art
models, while maintaining the same level of accuracy.
Index Terms - 3D ICs, Liquid-cooling, Compact Thermal Model, Finite Difference Method
Dark silicon is an emerging problem in multi-core processors, where it is not possible to enable all cores simultaneously because of either insufficient parallelism in software applications or because of high-spatial power densities that generate hot-spot constraints. Superlattice-based thermoelectric cooling (TEC) is a promising technology that offers large heat pumping capability and the ability to target hot spots of each core independently. In this paper, we devise novel system-level methods that address the two main sources of dark silicon using superlattice TECs. Our methods leverage the TECs in conjunction with dynamic voltage and frequency scaling and number of threads to maximize the performance of multicore processor under thermal and power constraints. Using an experimental setup based on a quad-core processor, we provide an evaluation of the trade-offs among performance, temperature and power consumption arising from the use of superlattice-based TECs. Our results demonstrate the potential of this emerging cooling technology in mitigating dark silicon problems and in improving the performance of multi-core processors.
Many-core architectures use large numbers of small temperature sensors to detect thermal gradients and guide thermal management schemes. In this paper a technique to identify thermal sensors which are operating outside a required accuracy is described. Unlike previous on-chip temperature estimation approaches, our algorithms are optimized to run on-line while thermal management decisions are being made. The accuracy of a sensor is determined by comparing its readings to expected values from a probability distribution function determined from surrounding sensors. Experiments show that a sensor operating outside a desired accuracy can be identified with a detection rate of over 90% and an average false alarm rate of < 6%, with a confidence level of 90%. The run time of our method is shown to be around 3x lower than a recently-published temperature estimation method, enhancing its suitability for runtime implementation.
Verification benefits from removing logic that is not relevant for a proof. Techniques for doing this are known as localization abstraction. Abstraction is often performed by selecting a subset of gates to be included in the abstracted model; the signals feeding into this subset become unconstrained cut-points. In this paper, we propose several improvements to substantially increase the scalability of automated abstraction. In particular, we show how a better integration between the BMC engine and the SAT solver is achieved, resulting in a new hybrid abstraction engine, that is faster and uses less memory. This engine speeds up computation by constant propagation and circuit-based structural hashing while collecting UNSAT cores for the intermediate proofs in terms of a subset of the original variables. Experimental results show improvements in the abstraction depth and size.
Craig interpolation has become a powerful and universal tool in the formal verification domain, where it is used not only for Boolean systems, but also for timed systems, hybrid systems, and software programs. The latter systems demand interpolation for fragments of first-order logic. When it comes to model checking, the structural compactness of interpolants is necessary for efficient algorithms. In this paper, we present a method to reduce the size of interpolants derived from proofs of unsatisfiability produced by SMT (Satisfiability Modulo Theory) solvers. Our novel method uses structural arguments to modify the proof in a way, that the resulting interpolant is guaranteed to have smaller size. To show the effectiveness of our approach, we apply it to an extensive set of formulas from symbolic hybrid model checking.
Automatic abstraction is an important component of modern formal verification flows. A number of effective SAT-based automatic abstraction methods use unsatisfiable cores to guide the construction of abstractions. In this paper we analyze the impact of unsatisfiable core minimization, using state-of-the-art algorithms for the computation of minimally unsatisfiable subformulas (MUSes), on the effectiveness of a hybrid (counterexample-based and proof-based) abstraction engine. We demonstrate empirically that core minimization can lead to a significant reduction in the total verification time, particularly on difficult testcases. However, the resulting abstractions are not necessarily smaller. We notice that by varying the minimization effort the abstraction size can be controlled in a non-trivial manner. Based on this observation, we achieve a further reduction in the total verification time.
This paper addresses the problem of reducing the size of Craig interpolants generated within inner steps of SAT-based Unbounded Model Checking. Craig interpolants are obtained from refutation proofs of unsatisfiable SAT runs, in terms of and/or circuits of linear size, w.r.t. the proof. Existing techniques address proof reduction, whereas interpolant compaction is typically considered as an implementation problem, tackled using standard logic synthesis techniques. We propose an integrated three step process, in which we: (1) exploit an existing technique to detect and remove redundancies in refutation proofs, (2) apply combinational logic reductions (constant propagation, ODC-based simplifications, and BDD-based sweeping) directly on the proof graph data structure, (3) eventually apply ad hoc combinational logic synthesis steps on interpolant circuits. The overall procedure is novel (as well as parts of the above listed steps), and represents an advance w.r.t. the state-of-the art. The paper includes an experimental evaluation, showing the benefits of the proposed technique, on a set of benchmarks from the Hardware Model Checking Competition 2011.
The accuracy of control systems analysis is of paramount importance as even minor design flaws can lead to disastrous consequences in this domain. This paper provides a higher-order-logic theorem proving based framework for the formal analysis of steady state errors in feedback control systems. In particular, we present the formalization of control system foundations, like transfer functions, summing junctions, feedback loops and pickoff points, and steady state error models for the step, ramp and parabola cases. These foundations can be built upon to formally specify a wide range of feedback control systems in higher-order logic and reason about their steady state errors within the sound core of a theorem prover. The proposed formalization is based on the complex number theory of the HOL-Light theorem prover. For illustration purposes, we present the steady state error analysis of a solar tracking control system.
Currently, BDD packages such as CUDD depend on chained hash tables. Although they are efficient in terms of memory usage, they exhibit poor cache performance due to dynamic allocation and indirections of data. Moreover, they are less appealing for concurrent environments as they need thread-safe garbage collectors. Furthermore, to take advantage of the benefits from multi-core platforms, it is best to re-engineer the underlying algorithms, such as whether traditional depth-first search (DFS) construction, breadth-first search (BFS) construction, or a hybrid BFS with DFS would be best. In this paper, we introduce a novel BDD package friendly to multicore platforms that builds on a number of heuristics. Firstly, we re-structure the Unique Table (UT) using a concurrency-friendly Hopscotch hashing to improve caching performance. Secondly, we re-engineer the BFS Queues with hopscotch hashing. Thirdly, we propose a novel technique to utilize BFS Queues to simultaneously work as a Computed Table (CT). Finally, we propose a novel incremental Mark-Sweep Garbage Collector (GC). We report results for both BFS and hybrid BFS-DFS construction methods. With these techniques, even with a single-threaded BDD, we were able to achieve a speedup of up to 8x compared to a conventional single-threaded CUDD package. When two-threads are launched, another 1.5x speedup is obtained.
A low-power and low-voltage BBPLL-based sensor interface for resistive sensors in Wireless Sensor Networks is presented. The interface is optimized towards low power, fast start-up time and fast conversion time, making it primarily useful in autonomous wireless sensor networks. The interface is time/frequency-based, making it less sensitive to lower supply voltages and other analog non-idealities, whereas conventional amplitude-based interfaces do suffer largely from these non-idealities, especially in smaller CMOS technologies. The sensor-to-digital conversion is based on the locking behavior of a digital PLL, which also includes transient behavior after startup. Several techniques such as VDD scaling, coarse and fine tuning and pulse-width modulated feedback are implemented to decrease the transient and acquisition time and the power to optimize the total energy consumption. In this way the sensor interface consumes only 61μW from a 0.8V DC power supply with a one-sample conversion time of less than 20μs worst-case. The sensor interface is designed and implemented in UMC130 CMOS technology and outputs 8 bit parallel with 7.72 ENOB. Due to its fast start-up time, fast conversion time and low power consumption, it only consumes 5.79 pJ/bit-conversion, which is a state-of-the-art energy efficiency compared to recent resistive sensor interfaces.
We propose a methodology for reachability analysis of nonlinear analog circuits to verify safety properties. Our iterative reachable set reduction algorithm initially considers the entire state space as reachable. Our algorithm iteratively determines which regions in the state space are unreachable and removes those unreachable regions from the over approximated reachable set. We use the State Partitioning Tree (SPT) algorithm to recursively partition the reachable set into convex polytopes. We determine the reachability of adjacent neighbor polytopes by analyzing the direction of state space trajectories at the common faces between two adjacent polytopes. We model the direction of the trajectories as a reachability decision function that we solve using a sound root counting method. We are faithful to the nonlinearities of the system. We demonstrate the memory efficiency of our algorithm through computation of the reachable set of Van der Pol oscillation circuit.
A fast technique for proving steady-state analog
circuit operation constraints is described. Based on SAT, the
technique is applicable to practical circuit design and modeling
scenarios as it does not require algebraic device models. Despite
the complexity of representing accurate transistor I/V characteristics,
run-time and problem scaling behavior is excellent.
Index Terms - Analog Verification, Discrete Representation, Circuit Modeling, SAT
This paper presents a technique for automatically extracting analytical behavioral models from the netlist of a nonlinear analog circuit. Subsequent snapshots of the internal circuit Jacobian are sampled during time-domain analysis and are then processed into Transfer Function Trajectories (TFT). The TFT data project the nonlinear dynamics of the system onto a hyperplane in the mixed state-space/frequency domain. Next Recursive Vector Fitting (RVF) algorithm is used to extract an analytical Hammerstein model out of the TFT data in an automated fashion. The resulting RVF model equations are implemented as an accurate nonlinear behavioral model in the time domain. The model is guaranteed stable by construction and can trade off complexity for accuracy. The technique is validated on a high-speed analog buffer circuit containing 70 linear and nonlinear components, showing a 7X speedup.
A statistical extension of the ultra-compact Virtual Source (VS) MOSFET model is developed here for the first time. The characterization uses a statistical extraction technique based on the backward propagation of variance (BPV) with variability parameters derived directly from the nominal VS model. The resulting statistical VS model is extensively validated using Monte Carlo simulations, and the statistical distributions of several figures of merit for logic and memory cells are compared with those of a BSIM model from a 40-nm CMOS industrial design kit. The comparisons show almost identical distributions with distinct run time advantages for the statistical VS model. Additional simulations show that the statistical VS model accurately captures non-Gaussian features that are important for low-power designs.
Flexible electronics are possible alternative for portable consumer applications with many advantages. However, the circuit design for flexible electronics is still challenging, especially for sensitive analog circuits. Significant parameter variations and bending effects of flexible TFTs further increase the difficulties for circuit designers. In this paper, an automatic circuit sizing technique is proposed for the analog circuits with flexible TFTs. The process variation and bending effects of flexible TFTs are considered simultaneously in the optimization flow. As shown in the experimental results, the proposed approach can further improve the design yield and significantly reduce the design overhead.
Functional testing of embedded processors is a challenging task and additional constraints are imposed when a functional test procedure has to be executed online. In the latter case, a significant amount of the processor faults cannot be detected since related to the debug/test circuitry or because of memory configuration constraints. In this paper we identify several sources of on-line functional untestability and propose a set of techniques to exactly measure their impact on the fault coverage. Experimental results related to an industrial case study are reported, showing that the fault coverage loss due to the considered untestability sources may reach more than 13%.
Soft error rates are estimated based on worst-case architectural vulnerability factor (AVF). Therefore, it makes tracking real-time accurate AVF very attractive to computer designers: more accurate AVF numbers will allow turning on more features at runtime while keeping the promised SDC and DUE rates. This paper presents a hardware mechanism based on linear regressions to estimate the AVF (SDC and DUE) of the register file for out-of-order cores. Our results show that we are able to have a high correlation factor at low cost.
This paper presents an innovative approach to detect soft errors in Ternary Content Addressable Memories (TCAMs) based on the use of Bloom Filters. The proposed approach is described in detail and its performance results are presented. The advantages of the proposed method are that no modifications to the TCAM device are required, the checking is done on-line and the approach has low power and area overheads.
We propose an AVF-driven parity selection method for protecting modern microprocessor in-core memory arrays against MBUs. As MBUs constitute more than 50% of the upsets in latest technologies, error correcting codes or physical interleaving are typically employed to effectively protect out-of-core memory structures, such as caches. However, such methods are not applicable to high-performance in-core arrays, due to computational complexity, high delay and area overhead. To this end, we revisit parity as an effective mechanism to detect errors and we resort to pipeline flushing and checkpointing for correction. We demonstrate that optimal parity tree construction for MBU detection is a computationally complex problem, which we then formulate as an integer-linear-program (ILP). Experimental results on Alpha 21264 and Intel P6 in-core memory arrays demonstrate that optimal parity tree selection can achieve great vulnerability reduction, even when a small number of bits are added to the parity trees, compared to simple heuristics. Furthermore, the ILP formulation allows us to find better solutions by effectively exploring the solution space in the presence of multiple parity trees; results show that the presence of 2 parity trees offers a vulnerability reduction of more than 50% over a single parity tree.
Die stacking based on Through-Silicon Via (TSV) is
considered as an efficient way to reducing power consumption
and form factor. In the current stage, the failure rate of TSV is
still high, so some type of defect tolerance scheme is required.
Meanwhile, the concept of double-via, which is normally used in
traditional layer to layer interconnection, can be one of the
feasible tolerance schemes. Double-via/TSV has a benefit
compared to TSV repair: it can eliminate the fuse configuration
procedure as well as the fuse layer. However, double-TSV has a
problem of signal degradation and leakage caused by short
defects. In this work, an enhanced scheme for double-TSV is
proposed to solve the short-defect problem through signal path
division and VDD isolation. Result shows that the enhanced
double-TSV can tolerate both open and short defects, with
reasonable area and timing overhead.
Keywords - TSV; 3D-IC; open-defect; short-defect; defect tolerance; yield improvement
Advances in memory compiler technology have helped accelerate the integration of hundreds of unique embedded memory macros in contemporary low-power, high-speed SoCs. The heavy use of compiled memories poses multiple challenges on the characterization, validation, and reliability fronts. This motivates solutions that can reduce overall cost, time, and risk to certify memories through the identification of a reduced set of "fundamental" memory macros that can be used one or more times to realize all memory instances in the design. This paper describes MemPack, a fast, general method based upon the classical change-making algorithm for the identification of such fundamental memory macros. By relaxing the need for exact realization of memories and tolerating wastage within the context of change-making, MemPack enables tradeoffs between memory capacity and reduction in the number of fundamental macros. It also controls multiplexing and instantiation costs, minimizing the impact on critical path delay and address line loading. Results on industrial and synthetic benchmarks for three different optimization objectives (performance, balance, and minimization) show that MemPack is effective in identifying fundamental sets that are as much as 16x smaller than the original set for 0.8-4.7% wasted bits.
Register renaming is a widely used technique to remove false dependencies in contemporary superscalar microprocessors. A register alias table (RAT) is formed to hold current locations of the values that correspond to the architectural registers. Some recently designed processors take a copy of the rename table at each branch instruction, in order to recover its contents when a misspeculation occurs. In this paper first we investigate the RAT vulnerability against transient errors. Then we analyze the vulnerability of RAT checkpoints and propose two techniques for soft error detection and correction utilizing redundantly taken copies of the entries whose content is the same with the previous and/or next checkpoints. Simulation results of the spec 2006 benchmarks reveal that on the average RAT vulnerability is 25% and checkpoint vulnerability is 6%. Results also reveal that redundancy exists at sequential checkpoint copies and can be used for error detection and correction purposes. We propose techniques that exploit this redundancy and show that faults in 41% of all checkpoints and 44% of rolled-back checkpoints can be detected and errors in 33% of the rolled-back checkpoints can be corrected. Since we exploit the already available storage, proposed error detection and correction techniques can be implemented with minimal hardware overhead.
Keywords - Microprocessors, Register Rename, Checkpoint, RAT Vulnerability, Soft Error, Error Detection and Correction
Many-core architectures used in embedded systems will contain hundreds of processors in the near future. Already now, it is necessary to study how to manage such systems when dynamically scheduling applications with different phases of parallelism and resource demands. A recent research area called invasive computing proposes a decentralized workload management scheme of such systems: applications may dynamically claim additional processors during execution and release these again, respectively. In this paper, we study how to apply the concepts of invasive computing for realizing decentralized core allocation schemes in homogeneous many-core systems with the goal of maximizing the average speedup of running applications at any point in time. A theoretical analysis based on game theory shows that it is possible to define a core allocation scheme that uses local information exchange between applications only, but is still able to provably converge to optimal results. The experimental evaluation demonstrates that this allocation scheme reduces the overhead in terms of exchanged messages by up to 61:4% and even the convergence time by up to 13:4% compared to an allocation scheme where all applications exchange information globally with each other.
Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks
Embedded architectures are moving to multi-core and many-core concepts in order to sustain ever growing computing requirements within complexity and power budgets. Programming many-core architectures not only needs parallel programming skills, but also efficient exploitation of fine grain parallelism at both architecture and runtime levels. Scheduler reactivity is however increasingly important as tasks granularity is reduced, in order to keep the overhead of the scheduling to a minimum. This paper presents a lightweight fork-join framework for scheduling fine grain parallel tasks on embedded many-core systems. The asynchronous nature of the fork-join model used in this framework permits to dramatically decrease its scheduling overhead. Experimentation conducted in this paper show that the overhead induced by this framework is of 33 cycles per scheduled task. Also, we show that near-ideal speedup can be obtained by the ARTM framework for data parallel applications and that ARTM achieves better results than other state of the art parallelization techniques.
We present the novel concept of Pipelets: self-organizing stages of software pipelines that monitor their computational demands and communication patterns and interact to optimize the performance of the application they belong to. They enable dynamic task remapping and exploit application-specific properties. Our experiments show that they improve performance by up to 31.2% compared to state-of-the-art when resource demands of applications alter at runtime as is the case for many complex applications.
As the semiconductor process is scaled down, the endurance of NAND flash memory greatly deteriorates. To overcome such a poor endurance characteristic and to provide a reasonable storage lifetime, system-level endurance enhancement techniques are rapidly adopted in recent NAND flash-based storage devices like solid-state drives (SSDs). In this paper, we propose an integrated lifetime management approach for SSDs. The proposed lifetime management technique combines several lifetime-enhancement schemes, including lossless compression, deduplication, and performance throttling, in an integrated fashion so that the lifetime of SSDs can be maximally extended. By selectively disabling less effective lifetime-enhancement schemes, the proposed technique achieves both high performance and high energy efficiency while meeting the required lifetime. Our evaluation results show that the proposed technique, over the SSDs with no lifetime management schemes, improves write performance by up to 55% and reduces energy consumption by up to 43% while satisfying a 5-year lifetime warranty.
Invited Speaker: Patrick Leduc
Panelists: Patrick Blouet, Brendan Farley, Anna Fontanelli, Dragomir Milojevic, Steve Smith
If asked "who needs faster planes?" the vast majority of the 2.75 billion airline passengers (source: IATA 2011) would say that they do need faster planes, and that they need them right now. Still, the commercial aircrafts cruising speed has remained the same - 800 km/hour - over the last 50+ years, and after the sad end of the Concorde project, neither Airbus nor Boeing are seriously working on the topic. Along the same lines, when asked "who needs 3D-IC?", most IC designers say that they desperately need 3D-IC to keep advancing electronic products performance, whilst addressing the needs of higher bandwidth, lower cost, heterogeneous integration, and power constraints. Still, 3D-IC continues to be the technology of the future. What are the road blocks towards 3D-IC adoption? Is it process technology, foundry or OSAT commercial offering, or EDA, or the business economics that is holding 3D-IC on the ground? In the introductory presentation of this panel session, LETI Patrick Leduc will illustrate the state-of-the-art of commercial, mainstream 3D-IC. EPFL Professor Giovanni de Micheli will then moderate an industry and research panel, to understand what are the key factors preventing 3D-IC from becoming the technology of today
The developments in micro-nano-electronics, biology and neuro-sciences make it possible to imagine a new world where vital signs can be monitored continuously, artificial organs can be implanted in human bodies and interfaces between the human brain and the environment can extend the capabilities of men thus making the dream of Dr. Frankenstein become true. This paper surveys some of the most innovative implantable devices and offers some perspectives on the ethical issues that come with the introduction of this technology.
The cost of healthcare is increasing worldwide.
Without disruptive changes, a large part of the population in
many developed countries will no longer be able to afford
healthcare by 2040. Part of the solution will come from focusing
on prevention. Having personal tools at everyone's disposal,
which will help people to monitor their health and to change their
behavior, can enable disease prevention. Managing weight and
managing stress are two societal challenges where a behavioral
change can have huge cost savings. In this paper, it is shown how
wearable sensor devices are able to detect energy expenditure as
well as monitor stress levels. System aspects and validation are
discussed. Because convenience and user acceptance are key for
making these tools a success, smaller form factors and more
convenient sensor locations on the body are required.
Keywords - wireless sensors, body-area networks, healthcare
A power delivery system for implantable biosensors
is presented. The system, embedded into a skin patch and
located directly over the implantation area, is able to transfer
up to 15 mW wirelessly through the body tissues by means of
an inductive link. The inductive link is also used to achieve
bidirectional data communication with the implanted device.
Downlink communication (ASK) is performed at 100 kbps;
uplink communication (LSK) is performed at 66.6 kbps. The
received power is managed by an integrated system including a
voltage rectifier, an amplitude demodulator and a load modulator.
The power management system is presented and evaluated by
means of simulations.
Index Terms - Remote powering, inductive link, energy harvesting, implantable biosensors, lactate measurement.
Keywords - neural engineering, MEMS, BMI, neural interfaces
This paper focuses on the resource sharing problem when performing high-level synthesis. It argues that the conventionally accepted synthesis flow when resource sharing is done after scheduling is sub-optimal because it cannot account for timing penalties from resource merging. The paper describes a competitive approach when resource sharing and scheduling are performed simultaneously. It provides a quantitative evaluation of both approaches and shows that performing sharing during scheduling wins over the conventional approach in terms of quality of results.
This paper describes a system-level approach to improve the latency of FPGA designs by performing optimization of the design specification on a functional level prior to high-level synthesis. The approach uses Taylor Expansion Diagrams (TEDs), a functional graph-based design representation, as a vehicle to optimize the dataflow graph (DFG) used as input to the subsequent synthesis. The optimization focuses on critical path compaction in the functional representation before translating it into a structural DFG representation. Our approach engages several passes of a traditional high-level synthesis (HLS) process in a simulated annealing-based loop to make efficient cost trade-offs. The algorithm is time efficient and can be used for fast design space exploration. The results indicate a latency performance improvement of 22% on average versus HLS with the initial DFG for a series of designs mapped to Altera Stratix II devices.
As the number of embedded applications increases,
companies are launching new platforms within short periods
of time to efficiently execute software with the lowest possible
energy consumption. However, for each new platform
deployment, new tool chains, with additional libraries,
debuggers and compilers must come along, breaking binary
compatibility. This strategy implies in high hardware and
software redesign costs. In this scenario, we propose the
exploitation of Custom Reconfigurable Arrays for
Multiprocessor Systems (CReAMS). CReAMS is composed of
multiple adaptive reconfigurable processors that
simultaneously exploit Instruction and Thread Level
Parallelism. It works in a transparent fashion, so binary
compatibility is maintained, with no need to change the
software development process or environment. We also show
that CReAMS delivers higher performance per watt in
comparison to a 4-issue Superscalar processor, when the same
power budget is considered for both designs.
keywords- reconfigurable system, multiprocessor, embedded systems
The fast evolving applications in modern digital signal processing have an increasing demand for components which have high computational power and energy efficiency without compromising the flexibility. Embedded FPGA, which is the customized FPGA with heterogeneous fine-grained application specific operations and routing resources, has shown significantly improved efficiency in terms of throughput, power dissipation and chip area for the target application domain. On the other hand, the complexity of such architecture makes it difficult to perform an efficient architecture exploration and application synthesis without tool support. In this work, we propose a framework for the design of embedded FPGA (eFPGA) architectures, which is extended from an existing framework for Coarse-Grained Reconfigurable Architectures (CGRAs). The framework is composed of a high-level modeling formalism for eFPGAs to explore the mapping space, and a retargetable application synthesis flow. To enable fast design space exploration, a force-directed placement algorithm is proposed. Finally, we demonstrate the efficacy of this framework with demanding application kernels.
Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS approaches, such a relation can be difficult to be computed, or not statically computable at all, and adopting conventional register binding techniques, even when feasible, cannot guarantee maximum performances. To overcome these issues we introduce a novel scheduling-independent liveness analysis methodology, suitable for dynamic scheduling architectures. Such liveness analysis is exploited in register binding using standard graph coloring techniques, and unlike other approaches it avoids the insertion of structural dependencies, introduced to prevent run-time resource conflicts in dynamic scheduling environments. The absence of additional dependencies avoids performance degradation and makes parallelism exploitation independent from the register binding task, while on average not impacting on area, as shown through the experimental results.
While Coarse-grained Reconfigurable Architectures (CGRAs) are very efficient at handling regular, compute-intensive loops, their weakness at control-intensive processing and the need for frequent reconfiguration require another processor, for which usually a main processor is used. To minimize the overhead arising in such collaborative execution, we integrate a dedicated sequential processor (SP) with a reconfigurable array (RA), where the crucial problem is how to share the memory between SP and RA while keeping the SP's memory access latency very short. We present a detailed architecture, control, and program example of our approach, focusing on our optimized on--chip shared memory organization between SP and RA. Our preliminary results demonstrate that our optimized memory architecture is very effective in reducing kernel execution times (23.5% compared to a more straightforward alternative), and our approach can reduce the RA control overhead and other sequential code execution time in kernels significantly resulting in up to 23.1% reduction in kernel execution time, compared to the conventional system using the main processor for sequential code execution.
Predication is an essential technique to accelerate
kernels with control flow on CGRAs. While state-based full
predication (SFP) can remove wasteful power consumption on
issuing/decoding instructions from conventional full predication,
generating code for SFP is challenging for general CGRAs,
especially when there are multiple conditionals to be handled
due to exploiting data level parallelism. In this paper, we present
a novel compiler framework addressing central issues such as
how to express the parallelism between multiple conditionals, and
how to allocate resources to them to maximize the parallelism.
In particular, by separating the handling of control flow and
data flow, our framework can be integrated with conventional
mapping algorithms for mapping data flow. Experimental results
demonstrate that our framework can find and exploit parallelism
between multiple conditionals, thereby leading to 2.21 times
higher performance on average than a naive approach.
Index Terms - CGRA; reconfigurable architecture; predication; predicated execution, conditional, compilation;
Energy efficiency of the underlying communication framework plays a major role in the performance of multicore systems. NoCs with buffer-less routing are gaining popularity due to simplicity in the router design, low power consumption, and load balancing capacity. With minimal number of buffers, deflection routers evenly distribute the traffic across links. In this paper, we propose an adaptive deflection router, DeBAR, that uses a minimal set of central buffers to accommodate a fraction of mis-routed flits. DeBAR incorporates a hybrid flit ejection mechanism that gives the effect of dual ejection with a single ejection port, an innovative adaptive routing algorithm, and a selective flit buffering based on flit marking. Our proposed router design reduces the average flit latency and the deflection rate, and improves the throughput with respect to the existing minimally buffered deflection routers without any change in the critical path.
Optical networks-on-chip (ONoCs) are currently still in the concept stage, and would benefit from explorative studies capable of bridging the gap between abstract analysis frameworks and the constraints and challenges posed by the physical layer. This paper aims to go beyond the traditional comparison of wavelength-routed ONoC topologies based only on their abstract properties, and for the first time assesses their physical implementation efficiency in an homogeneous experimental setting of practical relevance. As a result, the paper can demonstrate the significant and different deviation of topology layouts from their logic schemes under the effect of placement constraints on the target system. This becomes then the preliminary step for the accurate characterization of technology-specific metrics such as the insertion loss critical path, and to derive the ultimate impact on power efficiency and feasibility of each design.
Routing algorithms for NoCs were extensively studied in the last 12 years, and proposals for algorithms targeting some cost function, as latency reduction or congestion avoidance, abound in the literature. Fault-tolerant routing algorithms were also proposed, being the table-based approach the most adopted method. Considering SoCs with hundred of cores in a near future, features as scalability, reachability, and fault assumptions should be considered in the fault-tolerant routing methods. However, the current proposals some have some limitations: (1) increasing cost related to the NoC size, compromising scalability; (2) some healthy routers may not be reached even if there is a source-target path; (3) some algorithms restricts the number of faults and their location to operate correctly. The present work presents a method, inspired in VLSI routing algorithms, to search the path between source-target pairs where the network topology is abstracted. Results present the routing path for different topologies (mesh, torus, Spidergon and Hierarchical-Spidergon) in the presence of faulty routers. The silicon area overhead and total execution time of the path computation is small, demonstrating that the proposed method may be adopted in NoC designs.
While Networks-on-Chip (NoC) have been increasing in popularity with industry and academia, it is threatened by the decreasing reliability of aggressively scaled transistors. In this paper, we address the problem of faulty elements by the means of routing algorithms. Commonly, fault-tolerant algorithms are complex due to supporting different fault models while preventing deadlock. When moving from 2D to 3D network, the complexity increases significantly due to the possibility of creating cycles within and between layers. In this paper, we take advantages of the Hamiltonian path to tolerate faults in the network. The presented approach is not only very simple but also able to support almost all one-faulty unidirectional links in 2D and 3D NoCs.
Advances in technology scaling increasingly make Network-on-Chips (NoCs) more susceptible to failures that cause various reliability challenges. With increasing area occupied by different on-chip memories, strategies for maintaining fault-tolerance of distributed on-chip memories become a major design challenge. We propose a system-level design methodology for scalable fault-tolerance of distributed on-chip memories in NoCs. We introduce a novel reliability clustering model for fault-tolerance analysis and shared redundancy management of on-chip memory blocks. We perform extensive design space exploration applying the proposed reliability clustering on a block-redundancy fault-tolerant scheme to evaluate the tradeoffs between reliability, performance, and overheads. Evaluations on a 64-core chip multiprocessor (CMP) with an 8x8 mesh NoC show that distinct strategies of our case study may yield up to 20% improvements in performance gains and 25% improvement in energy savings across different benchmarks, and uncover interesting design configurations.
Modern systems-on-a-chip are equipped with power architectures, allowing to control the consumption of individual components or subsystems. These mechanisms are controlled by a power-management policy often implemented in the embedded software, with hardware support. Today's circuits have an important static power consumption, whose low-power design require techniques like DVFS or power-gating. A correct and efficient management of these mechanisms is therefore becoming nontrivial. Validating the effect of the power management policy needs to be done very early in the design cycle, as part of the architecture exploration activity. High-level models of the hardware must be annotated with consumption information. Temperature must also be taken into account since leakage current increases exponentially with it. Existing annotation techniques applied to loosely-timed or temporally-decoupled models would create bad simulation artifacts on the temperature profile (e.g. unrealistic peaks). This paper addresses the instrumentation of a timed transaction-level model of the hardware with information on the power consumption of the individual components. It can cope not only with power-state models, but also with Joule-per-bit traffic models, and avoids simulation artifacts when used in a functional/ power/temperature co-simulation.
Backend wearout mechanisms are major reliability
concerns for modern microprocessors. In this paper, a
framework which contains modules for backend time-dependent
dielectric breakdown (BTDDB), electromigration (EM), and
stress-induced voiding (SIV) is proposed to analyze circuit layout
geometries and interconnects to accurately estimate state-of-art
microprocessor lifetime due to each mechanism. Our
methodology incorporates the detailed electrical stress,
temperature, linewidth and cross-sectional areas of each
interconnect within the microprocessor system. We analyze
several layouts using our methodology and highlight the lifetime-limiting
wearout mechanisms, along with the reliability-critical
microprocessor functional units, using standard benchmarks
Keywords - Wearout Mechanisms; Microprocessor; Reliability; EM; SIV; SM; TDDB; Aging
Success tree analysis is a well-known method to quantify the dependability features of many systems. This paper presents a system-level methodology to automatically generate a success tree from a given embedded system implementation and subsequently analyzes its reliability based on a state-of-the-art Monte Carlo simulation. This enables the efficient analysis of transient as well as permanent faults while considering methods such as task and resource redundancy to compensate these. As a case study, the proposed technique is compared with two analysis techniques, successfully applied at system level: (1) a BDD-based reliability analysis technique and (2) a SAT-assisted approach, both suffering from exponential complexity in either space or time. Experimental results performed on an extensive test suite show that: (a) Opposed to the Success Tree (ST) and SAT-assisted approaches, the BDD-based approach is highly vulnerable to exhaust available memory during its construction for moderate and large test cases. (b) The proposed ST technique is competitive to the SAT-assisted analysis in analysis speed and accuracy, while being the only technique that is suitable to also handle large and complex system implementations in which permanent and transient faults may occur concurrently.
This paper presents a novel modeling technique for multicore embedded systems, called
Hybrid Protoyping. The fundamental idea is to simulate a design with multiple cores
by creating an emulation kernel in software on top of a single physical instance of the core.
The emulation kernal switches between tasks mapped to different cores and manages the
logical simulation times of the invidual cores. As a result, we can achieve fast and
cycle-accurate simulation of symmetric multicore designs, thereby overcoming the accuracy
concerns of virtual prototyping and the scalability issues of physical prototyping. Our
experiments with industrial multicore designs show that the simulation time with hybrid
prototyping grows only linearly with the number of cores and the inter-core communication
traffic, while providing 100% accuracy.
Keywords - Embedded systems; Validation; Multicore design; Virtual prototyping, FPGA prototyping
Shrinking transistor geometries, aggressive voltage scaling and higher operating frequencies have negatively impacted the dependability of embedded multicore systems. Most existing research works on fault-tolerance have focused on transient and permanent faults of cores. Intermittent faults are a separate class of defects resulting from on-chip temperature, pressure and voltage variations and lasting for a few cycles to several seconds or more. Operations of cores impacted by intermittent faults are suspended during these cycles but come back alive when conditions become favorable. This paper proposes a technique to model the availability of multiprocessor systems-on-chip (MPSoCs) with intermittent and reparable device defects. This model is based on Markov chain with stochastic fault distribution and can be applied even for permanent faults. Based on this model, a design space pruning technique is proposed to select a set of task mappings (with variable resource usage), which minimizes the task communication energy while satisfying the MPSoC availability constraint. Moreover, task migration overhead is also minimized, which is an important consideration for frequently occurring intermittent and temperature related faults, where prolonged system downtime during task re-mapping is not desired. Experiments conducted with real-life and synthetic application task graphs demonstrate that the proposed technique minimizes communication energy by 30% and reduces migration overhead by 50% as compared to the existing approaches.
Mesh NoCs are the most widely-used fabric in high-performance many-core chips today. They are, however, becoming increasingly power-constrained with the higher on-chip bandwidth requirements of high-performance SoCs. In particular, the physical datapath of a mesh NoC consumes significant energy. Low-swing signaling circuit techniques can substantially reduce the NoC datapath energy, but existing low-swing circuits involve huge area footprints, unreliable signaling or considerable system overheads such as an additional supply voltage, so embedding them into a mesh datapath is not attractive. In this paper, we propose a novel low-swing signaling circuit, a self-resetting logic repeater, to meet these design challenges. The SRLR enables single-ended low-swing pulses to be asynchronously repeated, and hence, consumes less energy than differential, clocked low-swing signaling. To mitigate global process variations while delivering high energy efficiency, three circuit techniques are incorporated. Fabricated in 45nm SOI CMOS, our 10mm SRLR-based low-swing datapath achieves 6.83Gb/s/μm bandwidth density with 40.4fJ/bit/mm energy at 4.1Gb/s data rate at 0.8V.
A 3D reconfigurable power switch network is introduced to optimally provide demand-supply matching between on-chip multi-output power converters and many-core microprocessors. For effective DVFS power management of many cores by area-efficient on-chip power converters, the reconfigurable power switch network supports space and time multiplexed access between power converters and cores. An integer linear programming is deployed to find one configuration of space-time multiplexing that can match between supply and demand with balanced utilization. The overall power management system is verified in SystemC-AMS based models. Experiment results show that the proposed design achieves 35.36% power saving on average when compared to the one without using the proposed power management.
The increased power densities of deep submicron process technologies have made on-chip temperature to become a critical design issue for high-performance integrated circuits. In this paper, we address the datapath merging problem faced during the design of coarse-grained reconfigurable processors from a thermal-aware perspective. Assuming a reconfigurable processor able to execute a sequence of datapath configurations, we formulate and efficiently solve the thermal-aware datapath merging problem as a minimum cost network flow. In addition, we integrate floorplan awareness of the underlying reconfigurable processor guiding the merging decision to account also for the effects of heat diffusion. Extensive experimentation regarding different configuration scenarios, technology nodes and clock frequencies showed that the adoption of the proposed thermal-aware methodology delivers up to 8.27K peak temperature reductions and achieves better temperature flattening in comparison to a low power but thermal-unaware approach.
This paper presents an efficient algorithm for the placement of power supply pads in flip-chip packaging for high-performance VLSI circuits. The placement problem is formulated as a mixed-integer linear program (MILP), subject to the constraints on mean-time-to-failure (MTTF) for the pads and the voltage drop in the power grid. To improve the performance of the optimizer, the pad placement problem is solved based on the divide-and-conquer principle, and the locality properties of the power grid are exploited by modeling the distant nodes and sources coarsely, following the coarsening stage in multigrid-like approach. An accurate electromigration (EM) model that captures current crowding and Joule heating effects is developed and integrated with our C4 placement approach. The effectiveness of the proposed approach is demonstrated on several designs adapted from publicly released benchmarks.
The floating random walk (FRW) algorithm is an important field-solver algorithm for capacitance extraction, which has several merits compared with other boundary element method (BEM) based algorithms. In this paper, the FRW algorithm is accelerated with the modern graphics processing units (GPUs). We propose an iterative GPU-based FRW algorithm flow and the technique using an inverse cumulative probability array (ICPA), to reduce the divergence among walks and the global-memory accessing. A variant FRW scheme is proposed to utilize the benefit of ICPA, so that it accelerates the extraction of multi-dielectric structures. The technique for extracting multiple nets concurrently is also discussed. Numerical results show that our GPU-based FRW brings over 20X speedup for various test cases with 0.5% convergence criterion over the CPU counterpart. For the extraction of multiple nets, our GPU-based FRW outperforms the CPU counterpart by up to 59X.
Jitter measurement is an essential part for testing high speed digital I/O and
clock distribution networks. Precise jitter characterization of signals at critical
internal nodes provides valuable information for hardware fault diagnosis and next
generation design. Recently, incoherent undersampling has been proposed as a low-cost
solution for signal integrity characterization at high data rate. Incoherent undersampling
drastically reduces the sampling rate compared to Nyquist rate sampling without relying on
the availability of a data synchronous clock. In this paper, we propose a jitter decomposition
and characterization method based on incoherent undersampling. Associated fundamental period
estimation techniques along with properties of incoherent undersampling, are used to isolate
the effects of periodic and periodic crosstalk jitter. Mathematical analysis and hardware
experiments using commercial off-the-shelf components are performed to prove the viability
of the proposed method.
Keywords - Incoherent Undersampling; Jitter Separation; Periodic Jitter; Bounded Uncorrelated Jitter; Crosstalk Jitter
In 3D VLSI, through-silicon vias (TSVs) are relatively large, and closely spaced. This results in a situation in which noise on one or more TSVs may deteriorate the delay and signal integrity of neighboring TSVs. In this paper, we first quantify the parasitics in contemporary TSVs, and then come up with a classification of crosstalk sequences as 0C, 1C, ... 8C sequences. Next, we present inductive approaches to quantify the exact overhead for 8C, 6C and 4C crosstalk avoidance codes (CACs) for a 3xn mesh arrangement of TSVs. These overheads for different CACs for a 3xn mesh arrangement of TSVs are used to calculate the lower bounds on the corresponding overheads for an nxn mesh arrangements of TSVs. We also discuss an efficient way to implement the coding and decoding (CODEC) circuitry for limiting the maximum crosstalk to 6C. Our experimental results show that for a TSV mesh arrangement driven by inverters implemented in a 22nm technology, the coding based approaches yields improvements which are in line with the theoretical predictions.
Realizable power grid reduction becomes key to efficient design and verification of nowadays large-scale power delivery networks (PDNs). Existing state-of-the-art realizable reduction techniques for interconnect circuits, such as TICER algorithm, can not be well suited for effective power grid reductions, since reducing the mesh-structured power grids by TICER's nodal elimination scheme may introduce excessive number of new edges in the reduced grids that can be even harder to solve than the original grid due to the drastically increased sparse matrix density. In this work, we present a novel geometric template based reduction technique for reducing large-scale flip-chip power grids. Our method first creates geometric template according to the original power grid topology and then performs novel iterative grid corrections to improve the accuracy by matching the electrical behaviors of the reduced template grid with the original grid. Our experimental results show that the proposed reduction method can reduce industrial power grid designs by up to 95% with very satisfactory solution quality.
Transistor aging due to bias temperature instability (BTI) is a major reliability concern in sub-32nm technology. Aging decreases performance of digital circuits over the entire IC lifetime. To compensate for aging, designs now typically apply adaptive voltage scaling (AVS) to mitigate performance degradation by elevating supply voltage. Varying the supply voltage of a circuit using AVS also causes the BTI degradation to vary over lifetime. This presents a new challenge for margin reduction in conventional signoff methodology, which characterizes timing libraries based on transistor models with pre-calculated BTI degradations for a given IC lifetime. Many works have separately addressed predictive models of BTI and the analysis of AVS, but there is no published work that considers BTI-aware signoff that accounts for the use of AVS during IC lifetime. This motivates us to study how the presence of AVS should affect aging-aware signoff. In this paper, we first simulate and analyze circuit performance degradation due to BTI in the presence of AVS. Based on our observations, we propose a rule-of-thumb for chip designers to characterize an aging-derated standard-cell timing library that accounts for the impact of AVS. According to our experimental results, this aging-aware signoff approach avoids both overestimation and underestimation of aging - either of which results in power or area penalty - in AVS enabled systems.
Efficient analysis of massive on-chip power delivery networks is among the most challenging problems facing the EDA industry today. Due to Joule heating effect and the temperature dependence of resistivity, temperature is one of the most important factors that affect IR drop and must be taken into account in power grid analysis. However, the sheer size of modern power delivery networks (comprising several thousands or millions of nodes) usually forces designers to neglect thermal effects during IR drop analysis in order to simplify and accelerate simulation. As a result, the absence of accurate estimates of Joule heating effect on IR drop analysis introduces significant uncertainty in the evaluation of circuit functionality. This work presents a new approach for fast electrical-thermal co-simulation of large-scale power grids found in contemporary nanometer-scale ICs. A state-of-the-art iterative method is combined with an efficient and extremely parallel preconditioning mechanism, which enables harnessing the computational resources of massively parallel architectures, such as graphics processing units (GPUs). Experimental results demonstrate that the proposed method achieves a speedup of 66.1X for a 3.1M-node design over a state-of-the-art direct method and a speedup of 22.2X for a 20.9M-node design over a state-of-the-art iterative method when GPUs are utilized.
This paper proposes a new model of functional units for variation-induced timing
errors due to PVT variations and device Aging (PVTA). The model takes into account
PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed
(P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring
circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively
mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at
two granularities of observation and adaptation: (i) fine-grained instruction-level; and
(ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level
adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction
monitoring and adaptation enhances throughput by a factor of 1.8x-2.1x depending on the
configuration of PVTA monitors and the type of instructions executed in the kernels.
Keywords - adaptive guardbanding; PVT variation; aging; GPU;
In this paper, we propose a framework that automatically generates a power network based on given placed design and verifies the power network by the commercial tool without IR and Electro-Migration (EM) violations. Our framework integrates synthesis, optimization and analysis of power network. A deterministic method is proposed to decide number and location of power stripes based on clustering analysis. After an initial power network is synthesized, we propose a sensitivity matrix Gs which is the correlation between updates in stripe resistance and nodal voltage. An optimization scheme based on Sequential Linear Programming (SLP) is applied to iteratively adjust power network to satisfy a given IR drop constraint. The proposed framework constantly updates voltage distribution in response to incremental change in power network. To accurately capture voltage distribution on a given chip, our power network models every existing power stripes and via resistances on each layer. Experimental result demonstrates that our power network analysis can accurately capture voltage distribution on a given chip and effectively minimize power network area. The proposed methodology is experimented on two real designs in TSMC 90nm and UMC 90nm technology respectively and achieves 9%-32% reduction in power network area, compared with the results from modern commercial PG synthesizer
In this paper, we propose a power density mitigation algorithm at post-placement stage. Our proposed framework first identifies cluster of bins with high temperature, then propagates power density away from high temperature region by balancing regional power density. The problem of balancing regional power density is modeled as a supply-demand problem and solution is obtained with minimal displacement of cells. An analytical temperature profiling algorithm is tightly integrated within the framework to constantly update the temperature profile in response to incremental perturbation to placement. Our proposed approach can effectively reduce maximum temperature compared to previous works on temperature mitigation.
Smooth approximations to half-perimeter wirelength are being investigated actively because of the recent increase in interest in analytical placement. It is necessary to not just provide smooth approximations but also to provide error analysis and convergence properties of these approximations. We present a new approximation scheme which uses a non-recursive approximation to the max function. We also show the convergence properties and the error bounds. The accuracy of our proposed scheme is better than those of the popular Logarithm-Sum-Exponential (LSE) wirelength model  and the recently proposed Weighted Average(WA) wirelength model. We also experimentally validate the comparison by using global and detail placements produced by NTU Placer  on ISPD 2004 benchmark suite. The experimentations on benchmarks validate that the error bounds of our model are lower, with an average of 4% error in the total wirelength.
The growing variability and complexity of advanced
CMOS technologies makes the physical design of clocked logic in
large Systems-on-Chip more and more challenging.
Asynchronous logic has been studied for many years and become
an attractive solution for a broad range of applications, from
massively parallel multi-media systems to systems with ultra-low
power & low-noise constraints, like cryptography, energy
autonomous systems, and sensor-network nodes. The objective of
this embedded tutorial is to give a comprehensive and recent
overview of asynchronous logic. The tutorial will cover the basic
principles and advantages of asynchronous logic, some insights
on new research challenges, and will present the GALS scheme as
an intermediate design style with recent results in asynchronous
Network-on-Chip for future Many Core architectures. Regarding
industrial acceptance, recent asynchronous logic applications
within the microelectronics industry will be presented, with a
main focus on the commercial CAD tools available today.
Keywords - component; asynchronous design, handshake circuits, GALS, CAD flow
The complex interactions between electric mobility on a large scale with the electric distribution grid constitute a considerable challenge regarding the feasibility, the efficiency and the stability of smart electric distribution grids. On the one hand, the steadily increasing share of decentralized power generation from renewable sources entails a move away from electro-mechanical generators with huge inertia towards systems with distributed small-and medium-scale generators which are coupled to the grid via inverters. On the other hand, large-scale electric mobility which interacts with such a decentralized grid will have a huge impact on the power generation, storage potential and consumption patterns of a grid. Grid infrastructure simulations which take into account the details of these interactions and which are backed by comprehensive demonstrators may help to shed light on crucial aspects of both energy and information exchange between the traffic and the electric energy infrastructure regime. This will be highlighted by selected topics which intend to shed light on the scope and the challenges inherent in this area of simulation.
The stochastic nature of renewable energy sources
will no doubt place strain upon the electrical distribution
networks as power generation is converted to environmentally
friendly methods. The use of energy storage technologies could
significantly improve the usability of these energy sources. A
domestic installation, based on a 4 kWh energy storage unit, is
under development and modeling shows that the proposed unit
would improve the energy autonomy of a household.
Keywords - Battery energy storage, photo-voltaics, smart grid.
This paper discusses novel communication network topologies and components and
describes an evolutionary path of bringing Ethernet into automotive applications
with focus on electric mobility. For next generation in-vehicle networking, the
automotive industry identified Ethernet as a promising candidate besides CAN and
FlexRay. Ethernet is an IEEE standard and is broadly used in consumer and industry
domains. It will bring a number of changes for the design and management of in-vehicle
networks and provides significant re-use of components, software, and tools. Ethernet
is intended to connect inside the vehicle high-speed communication requiring sub-systems
like Advanced Driver Assistant Systems (ADAS), navigation and positioning, multimedia,
and connectivity systems. For hybrid (HEVs) or electric vehicles (EVs), Ethernet will
be a powerful part of the communication architecture layer that enables the link between
the vehicle electronics and the Internet where the vehicle is a part of a typical Internet
of Things (IoT) application. Using Ethernet for vehicle connectivity will effectively manage
the huge amount of data to be transferred between the outside world and the vehicle through
vehicle-to-x (V2V and V2I or V2I+I) communication systems and cloud-based services for
advanced energy management solutions. Ethernet is an enabling technology for introducing
advanced features into the automotive domain and needs further optimizations in terms of
scalability, cost, power, and electrical robustness in order to be adopted and widely used by the industry.
Keywords - Ethernet; automotive; electric vehicle; smart grid; EV communication architecture; domain based communication; in-vehicle networking; vehicle network topology
This paper provides an overview on facts and trends towards the introduction of connected electric
vehicle (EV) and discusses how and to what extent electric mobility will be integrated into the
Internet of Energy (IoE) and Smart grid infrastructure to provide novel energy management solutions.
In this context the EVs are evolving from mere transportation mediums to advanced mobile connectivity
Keywords - electric vehicle; Internet of Energy; in-vehicle communication; telematics;connected vehicle
This paper provides an overview on the introduction of electric vehicles (EV) and discusses how electric
mobility will influence the developments in automotive industry by integrating the EVs into the Internet
of Energy (IoE) and Smart grid infrastructure by providing novel business models and requiring new semiconductor
devices and modules. In this context the EVs are evolving from mere transportation mediums to advanced mobile
connectivity ecosystem platforms.
Keywords - electric vehicle; Internet of Energy; in-vehicle communication; telematics;connected vehicle
This paper provides an overview of the latest
developments in the development of semiconductor devices for
implementation of electronic modules for EVs and HEVs and the
implementation of charging stations and the interface with the
smart grid infrastructure. The design choices are influenced by
the power level of the different applications.
Keywords - electric vehicle; Internet of Energy; semiconductor technologies; MOS; IGBT;
In this paper we outline a novel way to 1) predict the revenue associated with a wafer, 2) maximize the projected revenue through unconventional yield enhancement techniques, and 3) produce dice from the same mask that may have different performances and selling prices. Unlike speed binning, such heterogeneity is intentional by design. To achieve these goals we overturn the traditional concepts of redundancy, and present a novel design flow for yield enhancement called "Reduced Redundancy Insertion", where spares can potentially have less area and performance than their fathers. We develop a model for the revenue associated with the new design methodology that integrates system configuration and leverages yield, area and performance. The primary metric used in this model is termed "Expected Performance per Area", which is a measure that can be reliably estimated for different system architectures, and can be maximized by using algorithms proposed in this paper. We present theoretical models and case studies that characterize our designs, and experimental results that validate our prediction. We show that using Reduced Redundancy can improve wafer revenue by 10-30%.
State-of-the-art reliability optimizing schemes deploy spatial or temporal redundancy for the complete functionality. This introduces significant performance/area overhead which is often prohibitive within the stringent design constraints of embedded systems. This paper presents a novel scheme for selective software reliability optimization constraint under user-provided tolerable performance overhead constraint. To enable this scheme, statistical models for quantifying software resilience and error masking properties at function and instruction level are proposed. These models leverage a whole new range of reliability optimization. Given a tolerable performance overhead, our scheme selectively protects the reliability-wise most important instructions based on their masking probability, vulnerability, and redundancy overhead. Compared to state-of-the-art , our scheme provides a 4.84X improved reliability at 50% tolerable performance overhead constraint.
In this paper, we approach embedded systems design from a new angle that considers not only quality of service but also security as part of the design process. Moreover, we also take into consideration the dynamic aspect of modern embedded systems in which the number and nature of active tasks are variable during run-time. In this context, providing both high quality of service and guaranteeing the required level of security becomes a difficult problem. Therefore, we propose a novel secure embedded systems design framework that efficiently solves the problem of runtime quality optimization with security constraints. Experiments demonstrate the efficiency of our proposed techniques.
In this paper we present a novel approach for mapping interconnected software components onto cores of homogenous MPSoC architectures. The analytic mapping process considers shared memory communication as well as the routing algorithm controlling packet-based communication. The software components are mapped with the constraints of avoiding communication conflicts as well as access conflicts to shared memory resources. The core of the elaborated approach consists of an algorithm for software mapping which is inspired by force-directed scheduling from high-level synthesis. Experimental results show that the presented approach increases the overall system performance by 22% while reducing the average communication latency by 35%. For presenting the major advantages of the developed solution, we optimized an advanced driver assistance system on the Tilera TILEPro64 processor.
The advantages of moving from 2-Dimensional Networks-on-Chip (NoCs) to 3-Dimensional NoCs for any application must be justified by the improvements in performance, power, latency and the overall system costs, especially the cost of Through-Silicon-Via (TSV). The trade-off between the number of TSVs and the 3D NoCs system performance becomes one of the most critical design issues. In this paper, we present a fast and optimized task allocation method for low vertical link density (TSV number) 3D NoCs based many core systems, in comparison to the classic methods as Genetic Algorithm (GA) and Simulated Annealing (SA), our method can save quite a number of design time.We take several state-of-the-art benchmarks and the generic scalable pseudo application (GSPA) with different network scales to simulate the achieved design (by our method), in comparison to GA and SA methods achieved designs, our technique can achieve better performance and lower cost. All the experiments have been done in GSNOC framework (written in SystemC-RTL), which can achieve the cycle accuracy and good flexibility.
Modern System-on-Chip (SoC) design relies heavily on efficient interconnects like Networks-on-Chip (NoCs). They provide an effective, flexible and cost efficient way of communication exchange between the individual processing elements of the SoC. Therefore, the choice of topology and design of the NoC itself plays a crucial role in the performance of the system. Depending on the field of application, standard topologies like meshes, fat-trees, and tori might be suboptimal in terms of power consumption, latency and area. This calls for a custom topology design methodology, which is based on the requirements imposed by the application, function and the use-cases of the SoC in question. This work proposes a fast approach, which uses spectral clustering and cluster ensembles to partition the system using normalized cuts and insert the necessary routers. Then, by using delay-constrained minimum spanning trees, links between the individual routers are created, such that any present latency constraints are satisfied at minimum cost. Results from applying the methodology to a smartphone SoC are presented.
This paper presents the first parameterized, SPICE-compatible compact model of a Graphene Nano-Ribbon Field-Effect Transistor (GNRFET) with doped reservoirs that also supports process variation. The current and charge models closely match numerical TCAD simulations. In addition, process variation in transistor dimension, edge roughness, and doping level in the reservoir are accurately modeled. Our model provides a means to analyze delay and power of graphene-based circuits under process variation, and offers design and fabrication insights for graphene circuits in the future. We show that edge roughness severely degrades the advantages of GNRFET circuits; however, GNRFET is still a good candidate for low-power applications.
Nanomagnet Logic (NML) is an emerging device architecture that performs logic operations through fringing field interactions between nano-scale magnets. The design space for NML circuits is large and so far there exists no systematic approach for determining the parameter values (e.g., device-to-device spacings, clocking field strength etc.) to generate a predictable design solution. This paper presents a formal methodology for designing NML circuits that marshals the design parameters to generate a layout that is guaranteed to evolve correctly in time at 0K. The approach is further augmented to identify functional design targets when considering thermal noise associated with higher temperatures. The approach is applied to identify layouts for a 2-input AND gate, a "corner turn," and a 3-input majority gate. Layouts are verified through simulations both at 0K and room temperature (300K).
Crossbar-based architectures are promising for the future nanoelectronic systems. However, due to the inherent unreliability of nanoscale devices, the implementation of any logic functions relies on aggressive defect-tolerant schemes applied at the post-manufacturing stage.Most of such defect-tolerant approaches explore mapping choices between logic variables/products and crossbar vertical/horizontal wires. In this paper, we develop a new approach, namely fine-grained logic hardening, based on the idea of adding redundancies into a logic function so as to boost the success rate of logic implementation. We propose an analytical framework to evaluate and fine-tune the amount and location of redundancy to be added for a given logic function. Furthermore, we devise a method to optimally harden the logic function so as to maximize the defect tolerance capability. Simulation results show that the proposed logic hardening scheme boosts defect tolerance capability significantly in yield improvement, compared to mapping-only schemes with the same amount of hardware cost.
Power consumption has become one of the primary challenges in meeting Moore's law. Fortunately, Single-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra low power consumption during operation. An automated mapping approach for the SET architecture has been proposed recently for facilitating design realization. In this paper, we propose an enhanced approach consisting of variable reordering, product term reordering, and mapping constraint relaxation techniques to minimizing the area of mapped SET arrays. The experimental results show that our enhanced approach, on average, saves 40% in area and 17% in mapping time compared to the state-of-the-art approach for a set of MCNC and IWLS 2005 benchmarks.
This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active power is intermittent refresh process for the DRAM-mode. D-MRAM has advantage to reduce static power consumptions compared to the conventional SRAM, because there are no static leakage paths in the D-MRAM cell and it is not needed to supply voltage to its cells when used as the MRAM-mode. Besides, with advanced perpendicular magnetic tunnel junctions (p-MTJ), which decreases the write energy and latency without shortening its retention time, D-MRAM is capable of power reduction by replacing the traditional SRAM caches. Considering the 65-nm CMOS technology, the access latencies of 1MB memory macro are 2.2 ns / 1.5 ns for read / write in DRAM mode, and 2.2 ns / 4.5 ns in MRAM mode, while those of SRAM are 1.17 ns. The SPEC CPU2006 benchmarks have revealed that the energy per instruction (EPI) of the total cache memory can be dramatically reduced by 71 % on average, and the instruction per cycle (IPC) performance of the D-MRAM cache architecture degraded only by approximately 4 % on average in spite of its latency overhead.
Circuit reliability in the presence of variability is a major concern for SRAM designers. With the size of memory ever increasing, Monte Carlo simulations have become too time consuming for margining and yield evaluation. In addition, dynamic write-ability metrics have an advantage over static metrics because they take into account timing constraints. However, these metrics are much more expensive in terms of runtime. Statistical blockade is one method that reduces the number of simulations by filtering out non-tail samples, however the total number of simulations required still remains relatively large. In this paper, we present a method that uses sensitivity analysis to provide a total speedup of ~112X compared with recursive statistical blockade with only a 3% average loss in accuracy. In addition, we show how this method can be used to calculate dynamic VMIN and to evaluate several write assist methods.
Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach - shift based write - that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWMTAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.
Although intentional clock skew can be
utilized to reduce the clock period, its application in gated
clock designs has not been well studied. A gated clock
design includes both data paths and clock control paths, but
conventional clock skew scheduling only focus on data
paths. Based on that observation, in this paper, we propose
an approach to perform the co-synthesis of data paths and
clock control paths in a nonzero skew gated clock design.
Our objective is to minimize the required inserted delay for
working with the lower bound of the clock period (under
clocking constraints of both data paths and clock control
paths). Different from previous works, our approach can
guarantee no clocking constraint violation in the presence of
clock gating. Experimental results show our approach can
effectively enhance the circuit speed with almost no penalty
on the power consumption.
Keywords - Clock Period Minimization, Delay Insertion, Clock Gating, Data Path Synthesis.
In this paper we propose a flexible slack budgeting
approach for post-placement multi-bit flip-flop (MBFF) merging.
Our approach considers existing wiring topology and flip-flop
delay changes for achieving more accurate slack budgeting.
Besides, we propose a slack-to-length converting approach to
translating timing slack into equivalent wire length for
simplifying a merging process. We also develop a merging
method to evaluate our slack budgeting approach. Our slack
budgeting and MBFF merging programs are fully integrated into
an industrial design flow. Experimental results show that our
approach on average achieves 3.4% area saving, 50% clock tree
power saving, and 5.3% total power saving.
Keywords - Multi-bit flip-flop; slack budgeting; low power
A methodology to optimize the area of a fixed non-slicing floorplan is presented in this paper. Areas of transistors, capacitors and resistors are formulated as convex functions and area is minimized by solving a sequence of convex problems. The methodology is practical even with many components and variants. Moreover symmetry constraints are satisfied during optimization.
This paper presents an agile hierarchical synthesis framework for analog circuit. To acknowledge the limitation for a given topology analog circuit, this hierarchical synthesis work proposes a performance exploration technique and a non-uniform-step simulation process. Apart from spec targeted designs, this proposed approach can help to search the solutions better than designers' expectation. A parallel genetic algorithm (PAGE) method is employed for performance exploration. Unlike other evolution-based topology explorations, this is the first method that regards performance constraints as input genome for evolution and resolves the multiple-objective problem with the multiple-population feature. Populations of selected performance are transfered to device variables by re-targeting technique. Based on a normalization of device variable distribution, a probabilistic stochastic simulation significantly reduces the convergence time to find the global optima of circuit performance. Experimental results show that our approach on radio-frequency distributed amplifier (RFDA) and folded cascode operational amplifier (Op-Amp) in different technologies can obtain better runtime and higher quality in analog synthesis.
Discrete gate sizing has attracted a lot of attention recently as the EDA industry faces the challenge of optimizing large standard cell-based circuits. The discreteness of the problem, along with complex timing models, stringent constraints and ever increasing circuit sizes make the problem very difficult to tackle. Lagrangian Relaxation is an effective technique to handle complex constrained optimization problems and therefore has been used for gate sizing. In this paper, we propose an improved Lagrangian Relaxation formulation for leakage power minimization that accounts for maximum gate input slew and maximum gate output capacitance in addition to the circuit timing constraints. We also present a fast topological greedy heuristic to solve the Lagrangian Relaxation Subproblem and a complementary procedure to fix the few remaining slew and capacitace violations. The experimental results, generated by using the ISPD 2012 Discrete Gate Sizing Contest infrastructure, show that our technique is able to optimize a circuit with up to 959K gates within only 51 minutes. Comparing to the ISPD Contest top three teams, our technique obtained on average 18.9%, 16.7% and 43.8% less leakage power, while being 38, 31 and 39 times faster.
Accurate estimators of key design metrics (power, area, delay, etc.) are increasingly required to achieve IC cost reductions in system-level through physical layout optimizations. At the same time, identifying physical or analytical models of design metrics has become very challenging due to interactions among many parameters that span technology, architecture and implementation. Metamodeling techniques can simplify this problem by deriving surrogate models from samples of actual implementation data. However, the use of metamodeling techniques in IC design estimation is still in its infancy, and practitioners need more systematic understanding. In this work, we study the accuracy of metamodeling techniques across several axes: (1) low- and high-dimensional estimation problems, (2) sampling strategies, (3) sample sizes, and (4) accuracy metrics. To help obtain more general conclusions, we study these axes for three very distinct chip design estimation problems: (1) area and power of networks-on-chip routers, (2) delay and output slew of standard cells under power delivery network noise, and (3) wirelength and buffer area of clock trees. Our results show that (1) adaptive sampling can effectively reduce the sample size required to derive surrogate models by up to 64% (or, increase estimation accuracy by up to 77%) compared with Latin hypercube sampling; (2) for low-dimensional problems, Gaussian process-based models can be 1.5x more accurate than tree-based models, whereas for high-dimensional problems, tree-based models can be up to 6x more accurate than Gaussian process-based models; and (3) a variant of weighted surrogate modeling , which we call hybrid surrogate modeling, can improve estimation accuracy by up to 3x. Finally, to aid architects, design teams, and CAD developers in selection of the appropriate metamodeling techniques, we propose guidelines based on the insights gained from our studies.
This paper presents a new flexible quadratic and partitioning-based global placement approach which is able to optimize a wide class of objective functions, including linear, sub-quadratic, and quadratic net lengths as well as positive linear combinations of them. Based on iteratively re-weighted quadratic optimization, our algorithm extends the previous linearization techniques. If l is the length of some connection, most placement algorithms try to optimize l1 or l2. We show that optimizing lp with 1 < p < 2 helps to improve even linear connection lengths. With this new objective, our new version of the flowbased partitioning placement tool BonnPlace  is able to outperform the state-of-the-art force-directed algorithms SimPL, RQL, ComPLx and closes the gap to MAPLE in terms of (linear) HPWL.
For the last several technology generations, VLSI designs in new technology nodes have had to confront the challenges associated with reduced scaling in wire delays. The solution from industrial back-end-of-line process has been to add more and more thick metal layers to the wiring stacks. However, existing physical synthesis tools are usually not effective in handling these new thick layers for design closure. To fully leverage these degrees of freedom, it is essential for the design flow to provide better communication among the timer, the router, and different optimization engines. This work proposes a new algorithm, CATALYST, to perform congestion- and timing-aware layer directive assignment. Our flow balances routing resources among metal stacks so that designs benefit from the availability of thick metal layers by achieving improved timing and buffer usage reduction while maintaining routability. Experiments demonstrate the effectiveness of the proposed algorithm.
The need to use feedback to come up with context-dependent and workload-aware strategies for runtime power and thermal management (PTM) in high-end and mobile processors has been advocated since the early 2000. Two seminal papers that appeared in 2002 ,  defined a framework for the use of feedback mechanisms for power and temperature control. In , the focus was on power management with the goal being to extend battery life on the AMD Mobile Athlon. This was one of the earliest papers to use DVFS settings as actuators to guarantee a given energy level in the battery at the end of a given time interval. The controller was implemented using a combination of OS files and Linux kernel modules. Almost simultaneously,  posed the dynamic thermal management task as a formal control-theoretic problem requiring the thermal modeling of the processor and the use of the established control structures of classical feedback theory. Some of the defining features of  include the development of layout-based thermal RC models for the processor; the use of an architecturally-driven control mechanism, namely, the instruction fetching rate; and the use of the SPEC2000 benchmarks to illustrate temperature control action under various workloads. The controller used in  is a Proportional-Integral-Differential (PID) structure whose input is the deviation of the sensed temperature from the target temperature and whose output is the toggle rate of the instruction fetching mechanism.