10.6 Compilers and Tools for GPUs and MPSoCs

Time	Label	Presentation Title Authors
11:00	10.6.1	AN OPTIMIZED TASK-BASED RUNTIME SYSTEM FOR RESOURCE-CONSTRAINED PARALLEL ACCELERATORS Speaker: Daniele Cesarini, Università di Bologna, IT Authors: Daniele Cesarini, Andrea Marongiu and Luca Benini, Università di Bologna, IT Abstract Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Programming approaches that put the burden of handling the complexity of performance scalability on application developers are bound to fail at a wide scale. Distributing parallel work in an efficient manner to the available hardware resources should be controlled by system software libraries and runtime environments, while the programmers should focus on expressing parallelism at the application level. Task-based parallelism has the potential to provide such features, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that tasking can be efficiently enabled on embedded parallel accelerators by our proposal. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	10.6.2	A FINE-GRAINED PERFORMANCE MODEL FOR GPU ARCHITECTURES Speaker: Federico Busato, University of Verona, IT Authors: Nicola Bombieri, Federico Busato and Franco Fummi, University of Verona, IT Abstract The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmers with a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposed model for profiling commonly used primitives and real codes. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	10.6.3	CRITICAL POINTS BASED REGISTER-CONCURRENCY AUTOTUNING FOR GPUS Speaker: Ang Li, Eindhoven University of Technology, NL Authors: Ang Li¹, Shuaiwen Leon Song², Akash Kumar³, Eddy Z. Zhang⁴, Daniel Chavarria² and Henk Corporaal¹ ¹Eindhoven University of Technology, NL; ²Pacific Northwest National Lab, US; ³Technische Universität Dresden, DE; ⁴Rutgers University, US Abstract The unprecedented prevalence of GPGPU is largely attributed to its abundant on-chip register resources, which allow massively concurrent threads and extremely fast context switch. However, due to internal memory size constraints, there is a tradeoff between the per-thread register usage and the overall thread concurrency. This becomes a design problem in terms of performance tuning, since the performance ``sweet spot'' which can be significantly affected by these two factors is generally unknown beforehand. In this paper, we propose an effective autotuning solution to quickly and efficiently select the optimal number of registers per-thread for delivering the best GPU performance. Experiments on three generations of GPUs (Nvidia Fermi, Kepler and Maxwell) demonstrate that our simple strategy can achieve an average of 10% performance improvement while a max of 50% over the original version without modifying the user code. Additionally, to reduce local cache misses due to register spilling and further improve performance, we explore three optimization schemes (i.e. bypass L1 for global memory access, enlarge local L1 cache and spill into shared memory) and discuss their impact on performance on a Kepler GPU. Download Paper (PDF; Only available from the DATE venue WiFi)
12:15	10.6.4	GRATER: AN APPROXIMATION WORKFLOW FOR EXPLOITING DATA-LEVEL PARALLELISM IN FPGA ACCELERATION Speaker: Abbas Rahimi, UC Berkeley, US Authors: Atieh Lotfi¹, Abbas Rahimi², Amir Yazdanbakhsh³, Hadi Esmaeilzadeh³ and Rajesh Gupta¹ ¹UC San Diego, US; ²UC Berkeley, US; ³Georgia Institute of Technology, US Abstract Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel's data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit data-level parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×-3.0× higher throughput with less than 1% quality loss. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP5-7, 426	MATLAB TO C COMPILATION TARGETING APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS Speaker: Francky Catthoor, Interuniversity Microelectronics Centre (IMEC), BE Authors: Ioannis Latifis¹, Karthick Parashar², Grigoris Dimitroulakos¹, Hans Cappelle², Christakis Lezos¹, Konstantinos Masselos¹ and Francky Catthoor² ¹University of Peloponnese, GR; ²Interuniversity Microelectronics Centre (IMEC), BE Abstract This paper discusses a MATLAB to C compiler exploiting custom instructions such as instructions for SIMD processing and instructions for complex arithmetic present in Application Specific Instruction Set Processors (ASIPs). The compiler generates ANSI C code in which the processor's special instructions are represented via specialized intrinsic functions. By doing this the generated code can be used as input to any C/C++ compiler. Thus the proposed compiler allows the description of the specialized instruction set of the target processor in a parameterized way allowing the support of any processor. The proposed compiler has been used for the generation of application code for an ASIP targeting DSP applications. The code generated by the proposed compiler achieves a speed up between 2x-30x on the targeted ASIP for six DSP benchmarks compared to the code generated by Mathworks MATLAB to C compiler. Thus the proposed compiler can be employed to reduce the development time/effort/cost and time to market by raising the abstraction of application design in an embedded systems / system-on-chip development context while still improving implementation efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break in Großer Saal + Saal 1 Keynote Lecture in "Saal 2" 13:30 - 14:00

Time

Label

Presentation Title
Authors

11:00

10.6.1

AN OPTIMIZED TASK-BASED RUNTIME SYSTEM FOR RESOURCE-CONSTRAINED PARALLEL ACCELERATORS
Speaker:
Daniele Cesarini, Università di Bologna, IT
Authors:
Daniele Cesarini, Andrea Marongiu and Luca Benini, Università di Bologna, IT
Abstract
Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Programming approaches that put the burden of handling the complexity of performance scalability on application developers are bound to fail at a wide scale. Distributing parallel work in an efficient manner to the available hardware resources should be controlled by system software libraries and runtime environments, while the programmers should focus on expressing parallelism at the application level. Task-based parallelism has the potential to provide such features, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that tasking can be efficiently enabled on embedded parallel accelerators by our proposal.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

10.6.2

A FINE-GRAINED PERFORMANCE MODEL FOR GPU ARCHITECTURES
Speaker:
Federico Busato, University of Verona, IT
Authors:
Nicola Bombieri, Federico Busato and Franco Fummi, University of Verona, IT
Abstract
The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmers with a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposed model for profiling commonly used primitives and real codes.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

10.6.3

CRITICAL POINTS BASED REGISTER-CONCURRENCY AUTOTUNING FOR GPUS
Speaker:
Ang Li, Eindhoven University of Technology, NL
Authors:
Ang Li¹, Shuaiwen Leon Song², Akash Kumar³, Eddy Z. Zhang⁴, Daniel Chavarria² and Henk Corporaal¹
¹Eindhoven University of Technology, NL; ²Pacific Northwest National Lab, US; ³Technische Universität Dresden, DE; ⁴Rutgers University, US
Abstract
The unprecedented prevalence of GPGPU is largely attributed to its abundant on-chip register resources, which allow massively concurrent threads and extremely fast context switch. However, due to internal memory size constraints, there is a tradeoff between the per-thread register usage and the overall thread concurrency. This becomes a design problem in terms of performance tuning, since the performance ``sweet spot'' which can be significantly affected by these two factors is generally unknown beforehand. In this paper, we propose an effective autotuning solution to quickly and efficiently select the optimal number of registers per-thread for delivering the best GPU performance. Experiments on three generations of GPUs (Nvidia Fermi, Kepler and Maxwell) demonstrate that our simple strategy can achieve an average of 10% performance improvement while a max of 50% over the original version without modifying the user code. Additionally, to reduce local cache misses due to register spilling and further improve performance, we explore three optimization schemes (i.e. bypass L1 for global memory access, enlarge local L1 cache and spill into shared memory) and discuss their impact on performance on a Kepler GPU.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:15

10.6.4

GRATER: AN APPROXIMATION WORKFLOW FOR EXPLOITING DATA-LEVEL PARALLELISM IN FPGA ACCELERATION
Speaker:
Abbas Rahimi, UC Berkeley, US
Authors:
Atieh Lotfi¹, Abbas Rahimi², Amir Yazdanbakhsh³, Hadi Esmaeilzadeh³ and Rajesh Gupta¹
¹UC San Diego, US; ²UC Berkeley, US; ³Georgia Institute of Technology, US
Abstract
Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel's data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit data-level parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×-3.0× higher throughput with less than 1% quality loss.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP5-7, 426

MATLAB TO C COMPILATION TARGETING APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS
Speaker:
Francky Catthoor, Interuniversity Microelectronics Centre (IMEC), BE
Authors:
Ioannis Latifis¹, Karthick Parashar², Grigoris Dimitroulakos¹, Hans Cappelle², Christakis Lezos¹, Konstantinos Masselos¹ and Francky Catthoor²
¹University of Peloponnese, GR; ²Interuniversity Microelectronics Centre (IMEC), BE
Abstract
This paper discusses a MATLAB to C compiler exploiting custom instructions such as instructions for SIMD processing and instructions for complex arithmetic present in Application Specific Instruction Set Processors (ASIPs). The compiler generates ANSI C code in which the processor's special instructions are represented via specialized intrinsic functions. By doing this the generated code can be used as input to any C/C++ compiler. Thus the proposed compiler allows the description of the specialized instruction set of the target processor in a parameterized way allowing the support of any processor. The proposed compiler has been used for the generation of application code for an ASIP targeting DSP applications. The code generated by the proposed compiler achieves a speed up between 2x-30x on the targeted ASIP for six DSP benchmarks compared to the code generated by Mathworks MATLAB to C compiler. Thus the proposed compiler can be employed to reduce the development time/effort/cost and time to market by raising the abstraction of application design in an embedded systems / system-on-chip development context while still improving implementation efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session
Lunch Break in Großer Saal + Saal 1
Keynote Lecture in "Saal 2" 13:30 - 14:00

Visit us at DATE 2016