10.3 Design Experiences for Multimedia and Communication Applications

Time	Label	Presentation Title Authors
11:00	10.3.1	ENABLING THE HETEROGENEOUS ACCELERATOR MODEL ON ULTRA-LOW POWER MICROCONTROLLER PLATFORMS Speaker: Francesco Conti, Università di Bologna, IT Authors: Francesco Conti¹, Daniele Palossi², Andrea Marongiu¹, Davide Rossi¹ and Luca Benini¹ ¹Università di Bologna, IT; ²ETH Zurich, CH Abstract The stringent power constraints of complex microcontroller based devices (e.g. smart sensors for the IoT) represent an obstacle to the introduction of sophisticated functionality. Programmable accelerators would be extremely beneficial to provide the flexibility and energy efficiency required by fast-evolving IoT applications; however, the integration complexity and sub-10mW power budgets have been considered insurmountable obstacles so far. In this paper we demonstrate the feasibility of coupling a low power microcontroller unit (MCU) with a heterogenous programmable accelerator for speeding-up computation-intensive algorithms at an ultra-low power (ULP) sub-10mW budget. Specifically, we develop a heterogeneous architecture coupling a Cortex-M series MCU with PULP, a programmable accelerator for ULP parallel computing. Complex functionality is enabled by the support for offloading parallel computational kernels from the MCU to the accelerator using the OpenMP programming model. We prototype this platform using a STM Nucleo board and a PULP FPGA emulator. We show that our methodology can deliver up to 60x gains in performance and energy efficiency on a diverse set of applications, opening the way for a new class of ULP heterogeneous architectures. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	10.3.2	THERMAL OPTIMIZATION USING ADAPTIVE APPROXIMATE COMPUTING FOR VIDEO CODING Speaker: Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE Authors: Daniel Palomino¹, Muhammad Shafique², Altamiro Susin¹ and Jörg Henkel² ¹Universidade Federal do Rio Grande do Sul (UFRGS), BR; ²Karlsruhe Institute of Technology (KIT), DE Abstract This paper presents a thermal optimization technique that adaptively employs varying degree of approximations at both algorithm and data levels in order to reduce the temperature associated with the high efficiency video coding process while maintaining good quality results. The technique evaluates, at run-time, the regions of a video sequence, frame-by-frame, in terms of tolerance to imprecise computations. It adapts the amount of approximation errors based on the video sequence properties and application-specific knowledge. The proposed technique adaptively controls the strength of approximations (at both algorithm and data levels) depending upon the varying resilience properties of coding different regions with different texture/motion properties. Our content-driven approximate computing technique demonstrates the potential to improve the thermal profile of a chip. Experimental results show that our technique improves temperature profiles by reducing the on-chip temperature by about 10° C on average, while maintaining good quality results. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	10.3.3	HIGH PERFORMANCE TIME-OF-FLIGHT AND COLOR SENSOR FUSION WITH IMAGE-GUIDED DEPTH SUPER RESOLUTION Speaker: Hannes Plank, Infineon Technologies Austria AG, AT Authors: Hannes Plank, Gerald Holweg, Thomas Herndl and Norbert Druml, Infineon Technologies Austria AG, AT Abstract In recent years, depth sensing systems have gained popularity and have begun to appear on the consumer market. Of these systems, PMD-based Time-of-Flight cameras are the smallest available and will soon be integrated into mobile devices such as smart phones and tablets. Like all other available depth sensing systems, PMD-based Time-of-Flight cameras do not produce perfect depth data. Because of the sensor's characteristics, the data is noisy and the resolution is limited. Fast movements cause motion artifacts, which are undefined depth values due to corrupted measurements. Combining the data of a Time-of-Flight and a color camera can compensate these flaws and vastly improve depth image quality. This work uses color edge information as a guide so the depth image is upscaled with resolution gain and lossless noise reduction. A novel depth upscaling method is introduced, combining the creation of high quality depth data with fast execution. A high end smart phone development board, a color, and a Time-of-Flight camera are used to create a sensor fusion prototype. The complete processing pipeline is efficiently implemented on the graphics processing unit in order to maximize performance. The prototype proves the feasibility of our proposed fusion method on mobile devices. The result is a system capable of fusing color and depth data at interactive frame rates. When there is depth information available for every color pixel, new possibilities in computer vision, augmented reality and computational photography arise. The evaluation shows, our sensor fusion solution provides depth images with upscaled resolution, increased sharpness, less noise, less motion artifacts, and achieves high frame rates at the same time; thus significantly outperforms state-of-the-art solutions. Download Paper (PDF; Only available from the DATE venue WiFi)
12:15	10.3.4	SATURATED MIN-SUM DECODING: AN "AFTERBURNER" FOR LDPC DECODER HARDWARE Speaker: Stefan Scholl, University of Kaiserslautern, DE Authors: Stefan Scholl, Philipp Schläfer and Norbert Wehn, University of Kaiserslautern, DE Abstract LDPC codes are usually decoded by iterative belief propagation. However especially for small block lengths conventional belief propagation exhibits significant losses in signal-tonoise ratio compared to maximum likelihood decoding. In this paper we propose the combination of a conventional min-sum decoder enhanced by an advanced decoding scheme, that acts as a kind of "afterburner" to improve the frame error rate. We present hardware architectures and implementation results for a 28nm ASIC technology. The new decoder has a slightly higher complexity, but provides a gain of up to 1.6 dB signalto- noise ratio over conventional belief propagation decoding for short block length. In addition, we show, that the new decoder implementation can decrease the amount of dark silicon. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP5-3, 196	A DYNAMICALLY RECONFIGURABLE ECC DECODER ARCHITECTURE Speaker: Philippe Coussy, Universite Bretagne Sud / Lab-STICC, FR Authors: Awais Sani¹, Philippe Coussy² and Cyrille Chavet³ ¹Universite de Bretagne-Sud, FR; ²Universite de Bretagne-Sud / Lab-STICC, FR; ³Lab-STICC / Université de Bretagne Sud, FR Abstract Due to their impressive error correction performances, Error Correcting Codes (ECC) are now widely used in communication systems. In order to achieve high throughput requirements ECC decoders are based on parallel architectures, which results in a major issue: memory access conflicts. In this paper, we introduce a new class of ECC decoder architectures that dynamically reconfigures by executing on-chip a memory mapping approach. For that purpose, a dedicated algorithm taking into account network constraint is presented. A smart architecture based on a butterfly network and a reconfiguration unit is also proposed. Experimental results show that real-time reconfiguration at reasonable hardware cost is possible. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP5-4, 530	RESISTIVE BLOOM FILTERS: FROM APPROXIMATE MEMBERSHIP TO APPROXIMATE COMPUTING WITH BOUNDED ERRORS Speaker: Abbas Rahimi, University of California, Berkeley, US Authors: Vahideh Akhlaghi¹, Abbas Rahimi² and Rajesh K. Gupta¹ ¹University of California, San Diego, US; ²University of California, Berkeley, US Abstract Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation. Download Paper (PDF; Only available from the DATE venue WiFi)
12:32	IP5-5, 353	REAL-TIME SYSTEM-LEVEL IMPLEMENTATION OF A TELEPRESENCE ROBOT USING AN EMBEDDED GPU PLATFORM Speaker: Swathi Gurumani, Advanced Digital Sciences Center, SG Authors: Muhammad Teguh Satria¹, Swathi Gurumani¹, Wang Zheng², Keng Peng Tee², Augustine Koh¹, Pan Yu², Kyle Rupnow¹ and Deming Chen³ ¹Advanced Digital Sciences Center, SG; ²Institute for Infocomm Research, SG; ³UIUC, US Abstract Real-time applications such as telepresence systems present an opportunity to use embedded GPUs for compute acceleration to meet platform goals. In this paper, we develop a prototype of a portable, standalone telepresence robot that performs real-time attention-directed control using an NVIDIA Jetson TK1 embedded platform. We perform platform-specific optimizations to improve thread occupancy, optimize computa- tion workload and improve accuracy of face detection on the embedded GPU and achieve real-time performance of 30 frames per second on the Jetson TK1 and an overall speedup of 10x compared to the ARM CPU version. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break in Großer Saal + Saal 1 Keynote Lecture in "Saal 2" 13:30 - 14:00

Time

Label

Presentation Title
Authors

11:00

10.3.1

ENABLING THE HETEROGENEOUS ACCELERATOR MODEL ON ULTRA-LOW POWER MICROCONTROLLER PLATFORMS
Speaker:
Francesco Conti, Università di Bologna, IT
Authors:
Francesco Conti¹, Daniele Palossi², Andrea Marongiu¹, Davide Rossi¹ and Luca Benini¹
¹Università di Bologna, IT; ²ETH Zurich, CH
Abstract
The stringent power constraints of complex microcontroller based devices (e.g. smart sensors for the IoT) represent an obstacle to the introduction of sophisticated functionality. Programmable accelerators would be extremely beneficial to provide the flexibility and energy efficiency required by fast-evolving IoT applications; however, the integration complexity and sub-10mW power budgets have been considered insurmountable obstacles so far. In this paper we demonstrate the feasibility of coupling a low power microcontroller unit (MCU) with a heterogenous programmable accelerator for speeding-up computation-intensive algorithms at an ultra-low power (ULP) sub-10mW budget. Specifically, we develop a heterogeneous architecture coupling a Cortex-M series MCU with PULP, a programmable accelerator for ULP parallel computing. Complex functionality is enabled by the support for offloading parallel computational kernels from the MCU to the accelerator using the OpenMP programming model. We prototype this platform using a STM Nucleo board and a PULP FPGA emulator. We show that our methodology can deliver up to 60x gains in performance and energy efficiency on a diverse set of applications, opening the way for a new class of ULP heterogeneous architectures.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

10.3.2

THERMAL OPTIMIZATION USING ADAPTIVE APPROXIMATE COMPUTING FOR VIDEO CODING
Speaker:
Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE
Authors:
Daniel Palomino¹, Muhammad Shafique², Altamiro Susin¹ and Jörg Henkel²
¹Universidade Federal do Rio Grande do Sul (UFRGS), BR; ²Karlsruhe Institute of Technology (KIT), DE
Abstract
This paper presents a thermal optimization technique that adaptively employs varying degree of approximations at both algorithm and data levels in order to reduce the temperature associated with the high efficiency video coding process while maintaining good quality results. The technique evaluates, at run-time, the regions of a video sequence, frame-by-frame, in terms of tolerance to imprecise computations. It adapts the amount of approximation errors based on the video sequence properties and application-specific knowledge. The proposed technique adaptively controls the strength of approximations (at both algorithm and data levels) depending upon the varying resilience properties of coding different regions with different texture/motion properties. Our content-driven approximate computing technique demonstrates the potential to improve the thermal profile of a chip. Experimental results show that our technique improves temperature profiles by reducing the on-chip temperature by about 10° C on average, while maintaining good quality results.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

10.3.3

HIGH PERFORMANCE TIME-OF-FLIGHT AND COLOR SENSOR FUSION WITH IMAGE-GUIDED DEPTH SUPER RESOLUTION
Speaker:
Hannes Plank, Infineon Technologies Austria AG, AT
Authors:
Hannes Plank, Gerald Holweg, Thomas Herndl and Norbert Druml, Infineon Technologies Austria AG, AT
Abstract
In recent years, depth sensing systems have gained popularity and have begun to appear on the consumer market. Of these systems, PMD-based Time-of-Flight cameras are the smallest available and will soon be integrated into mobile devices such as smart phones and tablets. Like all other available depth sensing systems, PMD-based Time-of-Flight cameras do not produce perfect depth data. Because of the sensor's characteristics, the data is noisy and the resolution is limited. Fast movements cause motion artifacts, which are undefined depth values due to corrupted measurements. Combining the data of a Time-of-Flight and a color camera can compensate these flaws and vastly improve depth image quality. This work uses color edge information as a guide so the depth image is upscaled with resolution gain and lossless noise reduction. A novel depth upscaling method is introduced, combining the creation of high quality depth data with fast execution. A high end smart phone development board, a color, and a Time-of-Flight camera are used to create a sensor fusion prototype. The complete processing pipeline is efficiently implemented on the graphics processing unit in order to maximize performance. The prototype proves the feasibility of our proposed fusion method on mobile devices. The result is a system capable of fusing color and depth data at interactive frame rates. When there is depth information available for every color pixel, new possibilities in computer vision, augmented reality and computational photography arise. The evaluation shows, our sensor fusion solution provides depth images with upscaled resolution, increased sharpness, less noise, less motion artifacts, and achieves high frame rates at the same time; thus significantly outperforms state-of-the-art solutions.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:15

10.3.4

SATURATED MIN-SUM DECODING: AN "AFTERBURNER" FOR LDPC DECODER HARDWARE
Speaker:
Stefan Scholl, University of Kaiserslautern, DE
Authors:
Stefan Scholl, Philipp Schläfer and Norbert Wehn, University of Kaiserslautern, DE
Abstract
LDPC codes are usually decoded by iterative belief propagation. However especially for small block lengths conventional belief propagation exhibits significant losses in signal-tonoise ratio compared to maximum likelihood decoding. In this paper we propose the combination of a conventional min-sum decoder enhanced by an advanced decoding scheme, that acts as a kind of "afterburner" to improve the frame error rate. We present hardware architectures and implementation results for a 28nm ASIC technology. The new decoder has a slightly higher complexity, but provides a gain of up to 1.6 dB signalto- noise ratio over conventional belief propagation decoding for short block length. In addition, we show, that the new decoder implementation can decrease the amount of dark silicon.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP5-3, 196

A DYNAMICALLY RECONFIGURABLE ECC DECODER ARCHITECTURE
Speaker:
Philippe Coussy, Universite Bretagne Sud / Lab-STICC, FR
Authors:
Awais Sani¹, Philippe Coussy² and Cyrille Chavet³
¹Universite de Bretagne-Sud, FR; ²Universite de Bretagne-Sud / Lab-STICC, FR; ³Lab-STICC / Université de Bretagne Sud, FR
Abstract
Due to their impressive error correction performances, Error Correcting Codes (ECC) are now widely used in communication systems. In order to achieve high throughput requirements ECC decoders are based on parallel architectures, which results in a major issue: memory access conflicts. In this paper, we introduce a new class of ECC decoder architectures that dynamically reconfigures by executing on-chip a memory mapping approach. For that purpose, a dedicated algorithm taking into account network constraint is presented. A smart architecture based on a butterfly network and a reconfiguration unit is also proposed. Experimental results show that real-time reconfiguration at reasonable hardware cost is possible.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:31

IP5-4, 530

RESISTIVE BLOOM FILTERS: FROM APPROXIMATE MEMBERSHIP TO APPROXIMATE COMPUTING WITH BOUNDED ERRORS
Speaker:
Abbas Rahimi, University of California, Berkeley, US
Authors:
Vahideh Akhlaghi¹, Abbas Rahimi² and Rajesh K. Gupta¹
¹University of California, San Diego, US; ²University of California, Berkeley, US
Abstract
Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:32

IP5-5, 353

REAL-TIME SYSTEM-LEVEL IMPLEMENTATION OF A TELEPRESENCE ROBOT USING AN EMBEDDED GPU PLATFORM
Speaker:
Swathi Gurumani, Advanced Digital Sciences Center, SG
Authors:
Muhammad Teguh Satria¹, Swathi Gurumani¹, Wang Zheng², Keng Peng Tee², Augustine Koh¹, Pan Yu², Kyle Rupnow¹ and Deming Chen³
¹Advanced Digital Sciences Center, SG; ²Institute for Infocomm Research, SG; ³UIUC, US
Abstract
Real-time applications such as telepresence systems present an opportunity to use embedded GPUs for compute acceleration to meet platform goals. In this paper, we develop a prototype of a portable, standalone telepresence robot that performs real-time attention-directed control using an NVIDIA Jetson TK1 embedded platform. We perform platform-specific optimizations to improve thread occupancy, optimize computa- tion workload and improve accuracy of face detection on the embedded GPU and achieve real-time performance of 30 frames per second on the Jetson TK1 and an overall speedup of 10x compared to the ARM CPU version.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session
Lunch Break in Großer Saal + Saal 1
Keynote Lecture in "Saal 2" 13:30 - 14:00

Visit us at DATE 2016