10.6 Compilers and Tools for GPUs and MPSoCs

Printer-friendly version PDF version

Date: Thursday 17 March 2016
Time: 11:00 - 12:30
Location / Room: Konferenz 4

Chair:
Frank Hannig, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE

Co-Chair:
Lars Bauer, Karlsruhe Institute of Technology, DE

This session covers compiler optimisations and tools for efficient execution on GPUs and MPSoCs. The first paper presents a lightweight OpenMP implementation for parallel accelerators. The next two papers in this session focus on GPU performance modelling and tuning. The final paper leverages approximation to improve the throughput of OpenCL programs on FPGAs. In addition, an interactive presentation deals with Matlab to ASIP compilation.

TimeLabelPresentation Title
Authors
11:0010.6.1AN OPTIMIZED TASK-BASED RUNTIME SYSTEM FOR RESOURCE-CONSTRAINED PARALLEL ACCELERATORS
Speaker:
Daniele Cesarini, Università di Bologna, IT
Authors:
Daniele Cesarini, Andrea Marongiu and Luca Benini, Università di Bologna, IT
Abstract
Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Programming approaches that put the burden of handling the complexity of performance scalability on application developers are bound to fail at a wide scale. Distributing parallel work in an efficient manner to the available hardware resources should be controlled by system software libraries and runtime environments, while the programmers should focus on expressing parallelism at the application level. Task-based parallelism has the potential to provide such features, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that tasking can be efficiently enabled on embedded parallel accelerators by our proposal.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:3010.6.2A FINE-GRAINED PERFORMANCE MODEL FOR GPU ARCHITECTURES
Speaker:
Federico Busato, University of Verona, IT
Authors:
Nicola Bombieri, Federico Busato and Franco Fummi, University of Verona, IT
Abstract
The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmers with a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposed model for profiling commonly used primitives and real codes.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:0010.6.3CRITICAL POINTS BASED REGISTER-CONCURRENCY AUTOTUNING FOR GPUS
Speaker:
Ang Li, Eindhoven University of Technology, NL
Authors:
Ang Li1, Shuaiwen Leon Song2, Akash Kumar3, Eddy Z. Zhang4, Daniel Chavarria2 and Henk Corporaal1
1Eindhoven University of Technology, NL; 2Pacific Northwest National Lab, US; 3Technische Universität Dresden, DE; 4Rutgers University, US
Abstract
The unprecedented prevalence of GPGPU is largely attributed to its abundant on-chip register resources, which allow massively concurrent threads and extremely fast context switch. However, due to internal memory size constraints, there is a tradeoff between the per-thread register usage and the overall thread concurrency. This becomes a design problem in terms of performance tuning, since the performance ``sweet spot'' which can be significantly affected by these two factors is generally unknown beforehand. In this paper, we propose an effective autotuning solution to quickly and efficiently select the optimal number of registers per-thread for delivering the best GPU performance. Experiments on three generations of GPUs (Nvidia Fermi, Kepler and Maxwell) demonstrate that our simple strategy can achieve an average of 10% performance improvement while a max of 50% over the original version without modifying the user code. Additionally, to reduce local cache misses due to register spilling and further improve performance, we explore three optimization schemes (i.e. bypass L1 for global memory access, enlarge local L1 cache and spill into shared memory) and discuss their impact on performance on a Kepler GPU.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:1510.6.4GRATER: AN APPROXIMATION WORKFLOW FOR EXPLOITING DATA-LEVEL PARALLELISM IN FPGA ACCELERATION
Speaker:
Abbas Rahimi, UC Berkeley, US
Authors:
Atieh Lotfi1, Abbas Rahimi2, Amir Yazdanbakhsh3, Hadi Esmaeilzadeh3 and Rajesh Gupta1
1UC San Diego, US; 2UC Berkeley, US; 3Georgia Institute of Technology, US
Abstract
Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel's data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit data-level parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×-3.0× higher throughput with less than 1% quality loss.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP5-7, 426MATLAB TO C COMPILATION TARGETING APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS
Speaker:
Francky Catthoor, Interuniversity Microelectronics Centre (IMEC), BE
Authors:
Ioannis Latifis1, Karthick Parashar2, Grigoris Dimitroulakos1, Hans Cappelle2, Christakis Lezos1, Konstantinos Masselos1 and Francky Catthoor2
1University of Peloponnese, GR; 2Interuniversity Microelectronics Centre (IMEC), BE
Abstract
This paper discusses a MATLAB to C compiler exploiting custom instructions such as instructions for SIMD processing and instructions for complex arithmetic present in Application Specific Instruction Set Processors (ASIPs). The compiler generates ANSI C code in which the processor's special instructions are represented via specialized intrinsic functions. By doing this the generated code can be used as input to any C/C++ compiler. Thus the proposed compiler allows the description of the specialized instruction set of the target processor in a parameterized way allowing the support of any processor. The proposed compiler has been used for the generation of application code for an ASIP targeting DSP applications. The code generated by the proposed compiler achieves a speed up between 2x-30x on the targeted ASIP for six DSP benchmarks compared to the code generated by Mathworks MATLAB to C compiler. Thus the proposed compiler can be employed to reduce the development time/effort/cost and time to market by raising the abstraction of application design in an embedded systems / system-on-chip development context while still improving implementation efficiency.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session
Lunch Break in Großer Saal + Saal 1
Keynote Lecture in "Saal 2" 13:30 - 14:00