12.4 Design and Optimization for Low-Power Applications

Printer-friendly version PDF version

Date: Thursday, March 28, 2019
Time: 16:00 - 17:30
Location / Room: Room 4

Alberto Nannarelli, DTU, DK, Contact Alberto Nannarelli

Paolo Amato, Micron, IT, Contact Paolo Amato

This session explores low-power design from different point of views, from neural network based scheduling of multicores and image processing, to ultra low-power for near-theshold computing and continous monitoring IoT sensors.

TimeLabelPresentation Title
Ann Gordon-Ross, University of Florida, US
Ayobami Edun, Ruben Vazquez, Ann Gordon-Ross and Greg Stitt, University of Florida, US
Heterogeneous multicore systems help meet design goals by using different architectural components that are suitable for different application needs. The individual cores may also have different tunable architectural parameters for additional specialization. However, this creates a challenge in mapping applications to cores that contain the best configuration based on an application's needs. This decision can be made by performing a sample run of the application on each core type and configuration, or using heuristics to explore the design space, however, given complex systems, these methods may be infeasible. In this paper, we present a methodology for dynamic scheduling of applications on heterogeneous multicore systems using predictive methods for reduced energy consumption. We use an artificial neural network (ANN) to train our predictive model using hardware counters in the system. The trained network can then predict the best configuration. Our scheduler uses this prediction to schedule the application to the best core (the core that offers the best configuration) and configures that core to the best configuration. If the best core is busy, alternative idle cores are considered for scheduling or the application is stalled. This decision is made based on which option meets the energy advantage considerations. Our experiments show that the total energy of a system can be reduced by 28% on average as compared to the system that uses the same fixed cache configuration for all cores.
Sami Salamin, Karlsruhe Institute of Technology (KIT), DE
Sami Salamin, Hussam Amrouch and Joerg Henkel, Karlsruhe Institute of Technology, DE
Near-Threshold Computing (NTC) has recently emerged as an attractive paradigm as it allows devices to operate close to their optimal energy point (OEP). This work demonstrates, for the first time, that determining where the OEP of a processor exists is challenging because standard cells, forming the processor's netlist, unevenly profit w.r.t power and also unevenly degrade w.r.t delay when the voltage approaches the near-threshold region. To precisely explore, at design time, where OEP is, we create voltage-aware cell libraries that enable designers to seamlessly employ the standard tool flows, even they were not designed for that purpose, to perform voltage-aware timing and power analysis. Besides determining where the OEP is, we also demonstrate how providing logic synthesis tool flows with voltage-aware cell libraries results in a 35% higher performance at NTC. In addition, we investigate how the performance loss at NTC can be compensated through parallelized computing demonstrating, for the first time, that the OEP moves far from NTC as the number of cores increases. Our proposed methodology enables designers to select the maximum number of cores along with the optimal operating voltage jointly in which a specific power budget is fulfilled. Finally, we show how voltage-aware design for parallelized NTC provides [40%-50%] performance increase compared to traditional (i.e., voltage-unaware design) parallelized NTC.
Maxime Feyerick, ESAT-MICAS, KU Leuven, BE
Maxime Feyerick, Jaro De Roose and Marian Verhelst, KU Leuven, BE
A standard cell library targeting always-on operation at 1 kHz is designed at circuit-level. This paper proposes a design methodology to achieve robust operation with minimum energy. Such minimum energy per operation for always-on systems is achieved by one specific supply and threshold voltage Vth combination. As Vth is discrete in a practical bulk technology, this minimum can however not be achieved through simple voltage tuning. In the considered 90 nm CMOS technology, Vth is too low resulting in leakage dominated systems and preventing from attaining the minimum energy point in subthreshold. Three circuit techniques are optimally combined to fight leakage: stacking, reverse body biasing and optimal transistor dimensioning relying on second order effects of the dimensions on Vth. They jointly allow logic gates to achieve the best balance between dynamic and leakage power. Moreover, the paper presents modified flip-flop topologies that also reliably operate at 0.27 V along with the gates. Benefits of improved logic gates and flip-flops are demonstrated on a small always-on feature-extraction system calculating running average and variance on a 1 Ksample/s data stream. The resulting system consumes 162 pW in simulation, or two orders of magnitude less when compared to a commercial library at its 1 V nominal voltage, or 1 order of magnitude less when compared to the commercial library at the same 0.27 V operating voltage.
Antonio Cipolletta, Politecnico di Torino, IT
Valentino Peluso1, Antonio Cipolletta1, Andrea Calimera1, Matteo Poggi2, Fabio Tosi2 and Stefano Mattoccia2
1Politecnico di Torino, IT; 2Università di Bologna, IT
This work deals with the implementation of energy-efficient monocular depth estimation using a low-cost CPU for low-power embedded systems. The paper first describes the PyD-Net depth estimation network, which consists of a lightweight CNN able to approach state-of-the-art accuracy with ultra-low resource usage. Then it proposes an accuracy-driven complexity reduction strategy based on a hardware-friendly fixed-point quantization. Finally, it introduces the low-level optimization enabling effective use of integer neural kernels. The objective is threefold: (i) prove the efficiency of the new quantization flow on a depth estimation network, that is, the capability to retaining the accuracy reached by floating-point arithmetic using 16- and 8-bit integers, (ii) demonstrate the portability of the quantized model into a general-purpose 32-bit RISC architecture of the ARM Cortex family, (iii) quantify the accuracy-energy tradeoff of unsupervised monocular estimation to establish its use in the embedded domain. The experiments have been run on a Raspberry PI board powered by a Broadcom BCM2837 chipset. A parametric analysis conducted over the KITTI dateset shows marginal accuracy loss with 16-bit (8-bit) integers and energy savings up to 6.55x (9.23x) w.r.t. floating-point. Compared to high-end CPU and GPU the proposed solution improves scalability.
17:30End of session