W8 3PMCES - Performance, Power and Predictability of Many-Core Embedded Systems

Agenda

08:30 – 08:45	Opening Session Adam Morawiec, Tapani Ahonen, and Walter Stechele
08:45 – 09:30	Keynote Presentation Trustworthy Contract Computing through Seamless System Build and Operation Tapani Ahonen, Tampere University of Technology, FI Computing systems are built and operated in a way akin to manufacturing pipelines. In a pipeline organization all stages need to operate simultaneously without significant disruptions. When one stage fails, the others will come to a halt either immediately or after a short grace period enabled by a buffer queue. The ability to fix any possibly emerging issues as quickly as possible is crucial for maintaining system functionality and throughput. Hierarchically organized management is designed for keeping all the individual stages in efficient operation. Such management organizations are prone to introducing inefficiencies. Hierarchies become deeper with growing system complexity. At higher levels of management the area of responsibility is wider while the capability to directly control low-level operations is weaker. Information exchange and instruction chains are usually cumbersome when they span many levels of organization. This happens in part due to operations reserving information to internal use only as well as in part due to restrictive exchange interfaces. Incoherency of information available for different operations in the organization often leads to unnecessary functional redundancy and indirect, wasteful methods to execute joint functions. What purpose does it serve to encapsulate information in one operation or stage only? In a pure pipeline organization each stage is supposed to execute a highly specialized sub-system consisting of fully independent operations no other stage is capable of. Special tools and components for one operation are in the possession of one stage only. It makes perfect sense to decline internal cooperation for ensuring local control, as other stages will not be needed, nor is their involvement helpful. However, this is not the case for computing systems. Software functions cannot be executed without supporting hardware and most application code cannot be executed without supporting operating system functions. Design time tools cannot produce good results without detailed information of the run-time environment.
09:30 – 10:00	Session 1: Predictable Portability of Parallel Code Based on Increased Semantic Information Sharing and Fork-Join Programming Konstantin Popov, SICS Swedish ICT AB, SE Heterogeneous parallel computing seems to be the way forward in all market segments where computers are used. Parallelism ranges from a few cores in small embedded systems to hundreds or even thousands in the HPC domain. Heterogeneity ranges from capability heterogeneous to functional heterogeneous systems with multiple completely different compute engines like GPGPUs or HW accelerators. This poses a tremendous challenge for software developers who currently need to tailor their software for each individual platform. In this talk I present an approach to ease the programming burden by means of open semantic information interfaces across software abstraction layers and using programming models that are amenable for analysis, modeling and good run-time scheduling decisions.
10:00 – 11:00	Session 2: Managing Execution in Dynamic Environment Efficient Leader Election for Synchronous Shared-Memory Systems Vicent Sanz Marco, Raimund Kirner, and Michael Zolda, University of Hertfordshire, UK Leader election is a frequent problem for systems where it is important to coordinate activities of a group of actors. It has been extensively studied in the context of networked systems. But with the raise of many-core computer architectures, it also became important for shared-memory systems. In this paper we present an efficient leader election technique for synchronous shared-memory systems. Synchronous in our context means the response time of code sections with relevant communication patterns is bounded. This makes our approach more efficient compared to leader election methods explored for asynchronous shared-memory systems. Our leader election method is used to help making the scheduling layer LPEL fault tolerant. A Proposal on Parallel Software Development for Network-on-Chip based Many-Core system (Short Paper) Guoqing Zhang and Tapani Ahonen, Tampere University of Technology, FI This paper gives a proposal on parallel software development for Network-on-Chip (NoC) based many-core systems. In the proposal, we recommend to use SLURM, an opensource software which is widely used in computer clusters, for hardware mapping and resource management, and recommend to use Message Passing Interface (MPI) to handle communications among cores in NoC systems. The partition concept in SLURM is suggested to be used for hardware mapping, especially for safety critical systems where require separated hardware mappings for safety-critical tasks and non-safety-critical tasks. We also demonstrate a methodology on porting the Symmetric Multi-Processor (SMP) architecture of Linux on top of SLURM-and-MPI enabled NoC systems by employing the scheduling domain concept of the Linux kernel. In the methodology, the cores on a NoC system are separated into partitions according to hardware topology to handle I/O-bound processes and CPU-bound processes respectively. *A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip*(Short Paper) Nilesh Karavadara, Simon Folie, Michael Zolda, Nga Nguyen, and Raimund Kirner, University of Hertfordshire, UK Software developers are discovering that practices which have successfully served single-core platforms for decades do no longer work for multi-cores. Stream processing is a parallel execution model that is well-suited for architectures with multiple computational elements that are connected by a network. We propose a power-aware streaming execution layer for network-on-chip architectures that addresses the energy constraints of embedded devices. Our proof-of-concept implementation targets the Intel SCC processor, which connects 48 cores via a network-onchip. We motivate our design decisions and describe the status of our implementation.
11:00 – 11:30	Coffee Break and Poster Session
11:30 – 12:00	Invited Talk: Performance Prediction and Software Development on Many-core Processor Platform Benoit Dupont de Dinechin, Kalray, FR Computation-intensive embedded applications are often constrained by the end-to-end latency. Classic platforms that support these applications rely on FPGA or farms of digital signal processors that run under the supervision of micro-controllers. There are significant advantages to hosting those applications on many-core platforms, in particular reducing the system size, weight, and power (SWaP), while improving programmability. However, existing many-core platforms based on CPU or GPU and the associated software stacks do not allow for predictable or even repeatable processing latencies. In this keynote the architecture and programming models of the MPPA-256 processor will be presented. It integrates 256 processing engine (PE) cores and 32 resource management (RM) VLIW cores on a single 28nm CMOS chip. We discuss how this processor and its programming models are especially suited for computational intensive applications under latency constraints. We also introduce the metalibm library generator for the Kalray VLIW cores, which automatically produces high-performance and correctly rounded implementations for application-specific specializations of the C99 libm and IEEE 754-2008 mathematical functions.
12:00 – 13:00	Lunch and Poster Session
13:00 – 14:00	Session 3: Reliability and Safety of Multi-Core Platforms Hardware/Software-based Runtime Control of Multicore Processors-on-Chip for Reliability Management under Aging Constraints Walter Stechele and Erol Koser, Technical University Munich, DE Multicore processors-on-chip are gaining interest in safety-critical applications like aeronautic, automotive, and medical. Traditional methods for reliability, e.g. triple module redundancy, might be too expensive, therefore recent research is turning towards reliability-aware runtime management of multicore processors, including dynamic voltage frequency scaling and dynamic load distribution. We will present a case study, introducing a reliability layer between multiprocessor hardware and middleware, to cope with aging related degradation. MPSoC performance degradation (due to aging) predictability at high abstraction level and Applications Olivier Heron, CEA-LIST, FR The shrinking size of transistors and nano-wires results in increased device density, increased speed and reduced power consumption (Moore's Law). However, device reliability is reduced due to the non-ideal scaling of supply voltage. We observe two trends in semiconductor the industry. Firstly, MOS technology scaling is still continuing (and will continue). The failure physics become more complex and new failures, that were negligible in old technology modes, now emerge. The semiconductor industry has solutions to reach the ITRS requirements until the end of next year1. After that, only interim solutions are available. Secondly, Multi-Processor SoCs (MPSoCs) offer high performance; are cheap, consumes less power than high-end processors; and are able to support a large variety of applications. As MPSoCs are now applied in most market segments (automotive, consumer, HPC, etc.), both high-performance and reliability become major concerns, even for non-safety-critical applications. Current academic and commercial CAD tools aid the designer to get reliability projections of the chip in the very last design cycles before chip tape-out. However, MPSoC design and verification raise new challenges. They require new methodologies and CAD tools able to capture both architecture design and reliability at higher abstraction levels, such as transactional level. In the first development cycles, design space exploration is necessary to analyze different MPSoC configurations (memory sizes, processor pipeline depth and others) and SW and their impact on performance, power consumption and reliability. In this talk, I will present a methodology we propose in RELY project to model and predict reliability at high abstraction level and I will present some results on an MPSoC case study.
14:00 – 14:30	Session 4: European Projects Cluster European Project Cluster on Mixed-Criticality Systems Salvador Trujillo (IK4-IKERLAN, ES), Roman Obermaisser (University of Siegen, DE), Kim Gruettner (OFFIS – Institute for Information Technology, DE), Francisco J. Cazorla (Barcelona Supercomputing Center and IIIA-CSIC, ES), and Jon Perez (IK4-IKERLAN, ES) Modern embedded applications already integrate a multitude of functionalities with potentially different criticality levels into a single system and this trend is expected to grow in the near future. Without appropriate preconditions, the integration of mixed-criticality subsystems can lead to a significant and potentially unacceptable increase of engineering and certification costs. There are several ongoing research initiatives studying mixed criticality integration in multicore processors. Key challenges are the combination of software virtualization and hardware segregation and the extension of partitioning mechanisms jointly addressing significant extra-functional requirements (e.g., time, energy and power budgets, adaptivity, reliability, safety, security, volume, weight, etc.) along with development and certification methodology. This paper provides a summary of the challenges to be addressed in the design and development of future mixed criticality systems and the way in which some current European Projects on the topic address those challenges.
14:30 – 15:00	Coffee Break and Poster Session
15:00 – 16:00	Session 5: System Design Technologies Timing Analysis of a Heterogeneous Architecture with Massively Parallel Processor Arrays Deepak Gangadharan, Alexandru Tanase, Frank Hannig, and Jürgen Teich, University of Erlangen, Nuremberg, DE In this paper, we present some analytical results from the timing analysis of a heterogeneous architecture with massively parallel processor arrays (MPPA). Specifically, in this work, the MPPA is a tightly-coupled processor array (TCPA). In recent work, the TCPA has been shown to be timing predictable and symbolic loop scheduling has been used to compute predictable schedules for the execution of each application mapped on the TCPA and run in parallel. However, the timing predictability provided by the TCPA can only be ensured if the shared resources on the TCPA tile provide the required input data rates to the TCPA. Towards this, we formulate a condition that needs to be satisfied over the local shared bus for the data transfers from the local memory to the TCPA in order to achieve the required application quality and latency of output data. Further, we also formulate another condition that must be satisfied by DMA data transfers from memory tile to TCPA tile during the arbitration in the memory tile so that the service levels provided by the NoC for the DMA transfers are maximally utilized. An Accurate Power Estimation Method for MPSoC Based on SystemC Virtual Prototyping Khouloud Zine Elabidine, LIP6, FR The paper presents a novel method called DPE (Design Power Estimation) which estimates efficiently and fastly SoC’s power consumption. In fact our power modeling consists in defining fonctional activities that best charaterize each component of the considerated platform. In this work, we are not looking for making accurate estimations of power consumption; however we introduce a method that offers a global power characterization which helps performing design space exploration at an early stage of the design flow (SystemC virtual prototyping) and find the best trade-off between power and performance. Parallelization of Object Detection Algorithm through Hardware Threads for MPSoCs David Watson, Ali Ahmadinia, Gordon Morison, and Tom Buggy, Glasgow Caledonian University, UK Adapting software applications to multiprocessor system on chips (MPSoCs) typically follows multi-threaded design flows and data dependence analysis to implement concurrency. However, to take advantage of the hardware customizations possible through reconfigurable MPSoCs, hardware threads (HWTs) can be used to increase application concurrency and throughput, whilst complimenting multi-threaded design flows. In this work, we show how applications can be analyzed and tailored to use HWTs increase the concurrency of applications. We show how task flow graphs (TFGs) can be changed into a Kahn Process Network (KPN) which describes how software tasks can interact with HWTs over FIFOs to maintain memory coherency at the software level. We evaluate our MPSoC designs based on performance increase, throughput, and energy efficiency with a data-intensive face detection algorithm and obtain performance increases up to 3.6x compared to software-only implementation, and throughput and energy efficiencies of up to 85.97MB/s and 11.92MB/W respectively.
16:00 – 16:45	Panel Session: What is still needed to have a reliable embedded system development ecosystem in place? Moderator: Achim Rettberg, OFFIS, Germany Speakers: Sven Karlsson, DTU, DK Tapani Ahonen, TUT, FI Kim Grüttner, OFFIS, DE Benoit Dupont de Dinechin, Kalray, FR Walter Stechele, TU Munich, DE
16:45 – 17:00	Closing Session


POSTER PRESENTATIONS
	Adaptive Resource Control in Multi-core Systems Alexei Iliasov, Ashur Rafiev, Alexander Romanovsky, Andrey Mokhov, Alex Yakovlev, and Fei Xia, Newcastle University, UK Multi-core systems present a set of unique challenges and opportunities. In this paper we discuss the issues of power-proportional computing in a multi-core environment and argue that a cross-layer approach spanning from hardware to user-facing software is necessary to successfully address this problem. Criticality-Aware Functionality Allocation for Distributed Multicore Real-Time Systems Junhe Gan, Paul Pop, and Jan Madsen, Technical University of Denmark, DK We are interested in the implementation of mixed-criticality hard real-time applications on distributed architectures, composed of interconnected multicore processors, where each processing core is called a processing element (PE). The functionality of the mixed-criticality hard real-time applications is captured in the early design stages using functional blocks of different Safety-Integrity Levels (SILs). Before the applications are implemented, the functional blocks have to be decomposed into software tasks with SILs. Then, the software tasks have to be mapped and scheduled on the PEs of the architecture. We consider fixed-priority preemptive scheduling for tasks and non-preemptive scheduling for messages. We would like to determine the function-to-task decomposition, the type of PEs in the architecture and the mapping of tasks to the PEs, such that the total cost is minimized, the application is schedulable and the safety and security constraints are satisfied. The total costs capture the development and certification costs and the unit cost of the architecture. We propose a Genetic Algorithm-based approach to solve this two-objective optimization problem, and evaluate it using a real-life case-study from the automotive industry. Estimating Video Decoding Energies And Processing Times Utilizing Virtual Hardware Sebastian Berschneider, Christian Herglotz, Marc Reichenbach, Dietmar Fey, and André Kaup, Friedrich-Alexander-University Erlangen-Nuremberg, DE The market for embedded devices increases permanently. Especially cell- and smartphones, which are substantial tools for many people, become more and more complex and serve nowadays as portable computers. An important problem to these devices is the energy efficiency. The accumulator battery can be discharged within a few hours, especially when a smartphone processes computationally intensive tasks like video decoding. Therefore, modern devices tend to include power efficient processors. But not only power efficient hardware effects the overall power consumption, also the design of algorithms regarding energy efficient programming is an important task. Usually, energy efficient development is done using real hardware, where programs are executed and power consumption is measured. This process is highly costly and error prone. Moreover, expensive hardware equipment is necessary. Therefore, in this work we present a design methodology that enables to run the application software on virtual hardware (CPU) that counts the instructions and memory accesses. By multiplying a priorly measured energy and time per instruction to these counts, energy and time estimations are possible, without having to run the target application on real hardware. As a result, we present a methodology for writing embedded applications with immediate feedback about these non-functional properties. Increased Reliability of Many-Core Platforms through Thermal Feedback Control Matthias Becker, Kristian Sandström, Moris Behnam, and Thomas Nolte, MRTC / Mälardalen University, SE In this paper we present a low overhead thermal management approach to increase reliability of many-core embedded real-time systems. Each core is controlled by a feedback controller. We adapt the utilization of the core in order to decrease the dynamic power consumption and thus the corresponding heat development. Sophisticated control mechanisms allow us to migrate the load in advance, before reaching critical temperature values and thus we can migrate in a safe way with a guarantee to meet all deadlines. Performance Analysis of a Computer Vision Application with the STHORM OpenCL SDK Vítor Schwambach, Sébastien Cleyet-Merle, Alain Issard, STMicroelectronics, FR and Stéphane Mancini, TIMA lab, FR Computer vision applications constitute one of the key drivers for embedded many-core architectures. To enable parallel application performance estimation and optimization early in the development flow, the development environment must provide the developer with simulation tools for fast and precise application-level performance analysis. In this work, we port a face detection application onto the STHORM many-core accelerator using the STHORM OpenCL SDK. We compare performance results obtained with the STHORM cycle-approximate simulator and a prototype implementation, and show that a high mismatch is present. We identify the key contributors to this mismatch, and propose that these be addressed in the upcoming versions of the SDK to allow more precise simulation results for early design space exploration. PSE - Performance Simulation Environment Jussi Hanhirova and Vesa Hirvisalo, Aalto University, FI We use a resource reservation based simulation environment (PSE) as a research tool to experiment on how to co-model HW/SW schedulers. Our focus is on heterogenous systems with manycores. Task processing based systems use different load balancing schemes to make efficient use of resources and to schedule work within real-time constraints. As parallel MPSoCs are constantly evolving, simulation is a viable tool to explore different configurations. Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation. Mohamed Amine Bergach and Serge Tissot, Kontron, FR, Michel Syska and Robert De Simone, Inria, FR Recent Intel processors (IvyBridge, Haswell) contain an embedded on-chip GPU unit, in addition to the main CPU processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals: First, we want to study half-systematic ways to adjust the actual variant of the FFT algorithm, for a given size, to best fit the local memory capacity (the registers of a given GPU block) and perform computations without intermediate calls to distant memory; Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each FFT computation) can be sustained by local interconnects at a speed matching the integrated GPU computations, or conversely if they have a negative impact on performance when computing FFTs on GPUs ”at full blast”; Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar FFT implementations on the CPU side of the chip. We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that GPUs can compute FFTs of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the CPU side. Smart Scheduling of Streaming Applications via Timed Automata Waheed Ahmad, Robert de Groote, Philip K.F. Hölzenspies, Mariëlle Stoelinga, and Jaco van de Pol, University of Twente, NE Streaming applications such as video-in-video and multi-video conferencing impose high demands on system performance. On one hand, they require high system throughput. On the other hand, usage of the available resources must be kept to minimum in order to save energy. Synchronous dataflow (SDF) graphs are very popular computational models for analysing streaming applications. Recently, they are widely used for analysis of the streaming applications on a single processor as well as in a multiprocessing context. Smart scheduling techniques are critical for system lifetime so that the maximum throughput is obtained by running as few resources as possible. Current maximum throughput calculation methods of the SDF graphs requires an unbounded number of processors or static order scheduling of tasks. Other novel methods involves the conversion of an SDF graph to an equivalent Homogeneous SDF graph (HSDF). This approach results in a bigger graph; in the worst case, the size of converted HSDF graph could be exponentially bigger. This poster presents an alternative, novel approach to analyse SDF graphs on a given number of processors using a proved formalism for timed systems termed Timed Automata (TA). By definition, TA are automata in which the elapse of time is measured by clock variables. The conditions under which a transition can be taken are indicated by clock guards. Furthermore, invariants shows the conditions for a system to stay in a certain state. Synchronous communication between the timed automata is carried out by hand-shake synchronisation using input and output actions. Output and input actions are denoted with an exclamation mark and a question mark respectively, e.g. fire! and fire?. TA hold a good balance between expressiveness and tractability and are supported by various verification tools e.g. UPPAAL. We translate the SDF graph of an application, and a given architecture of computer processors into separate timed automaton. Both automata synchronise using the actions "req" and "fire". In this way, timed automaton of the application SDF graph is mapped on the timed automata of the architecture model. After that, we can analyse the performance using different measures of interest. In particular, the main contributions of this poster are: (1) Compositional translation of the SDF graphs into timed automata; (2) Exploiting the capabilities of UPPAAL to search the whole state-space and to find the schedule that fits on the available processors and maximises the throughput; (3) Finding the maximum throughput on homogeneous and heterogeneous platforms; (4) Quantitative model-checking. We also demonstrate that the deadlock freedom is preserved even if the number of processors varies. Results show that in some cases, the maximum throughput of an SDF graph remains same even if the number of processors is reduced. Similarly, a trade-off between the given number of processors and the maximum throughput can be obtained efficiently. Moreover, the benefits of quantitative model-checking and verification of the user-defined properties can also be enjoyed using different contemporary model-checkers. Future work includes energy optimal synthesis and scheduling, translation of the SDF graphs to Energy Aware Automata, extension of SDF graphs with energy costs and stochastics, dynamic power management (DPM) and reduction techniques of energy models. In order to tackle state-space explosion, we also plan to apply multi-core LTL model checking. *System Level Design Framework for Many-core Architectures* Pablo Peñil, Luis Diaz, and Pablo Sanchez, University of Cantabria, ES The complexity of the embedded, many-core architectures has been constantly increasing their shipment volume in recent years, providing a solution for creating highly optimized complex systems. In order to deal with the complexity of these many-core architectures, users are requiring new design methodologies that encompass system specification and performance analysis from the initial stages of the design process. The performance analysis frameworks should include software application and many-core hardware platform co-simulation in order to obtain estimations of the software execution time and performance of platform HW resources. This paper presents a fully-integrated host-compiled simulation framework which enables obtaining fast performance estimations for high-level system models. This framework could be integrated in a design exploration methodology that enables to choose the optimal specification and software parallelization, facilitating system implementation and minimizing designer effort.

Organisation

Mats Brorsson - professor of Computer Architecture at KTH, Sweden and a senior researcher at Swedish Institute of Computer Science (SICS). His current research are in programming models, run-time systems, operating systems and the architecture of parallel computer systems in particular multi- and many-core systems. Prof. Brorsson has authored and co-authored over 50 scientific papers in international conferences and journals.

Tapani Ahonen is a part-time Senior Scientist at Technoconsult (TC), Denmark and an Assistant Professor at Tampere University of Technology (TUT), Finland. His work is focused on proof-of-concept driven computer systems design with emphasis on many-core processing environments. Ahonen has an MSc in Electrical Engineering and a PhD in Information Technology from TUT. He has an extensive international publication record including edited books and journals, written book chapters and journal articles, invited talks in high-quality conferences, as well as full-length papers and paper abstracts in conference proceedings.

Sven Karlsson - associate professor at DTU Informatics, DTU, Denmark. His research interests are in programming models, compilers, architectures, operating systems and system software for parallel computers. He has published more than 30 papers in these fields.

Walter Stechele - associate professor at Technical University of Munich (TUM), Germany. His research interests include visual computing and robotic vision, with focus on Multi Processor System-on-Chip (MPSoC) architectures and design methodology, low power optimization, dynamic reconfiguration of FPGA devices, and applications in automotive and robotics.

Adam Morawiec - director at ECSI. He holds a PhD from TIMA Lab/INPG in Grenoble and is working in the domain of specification and design languages, system design and synthesis. He is an author of several scientific publications and editor of 4 books. He was also a chair of scientific conferences (DASIP, S4D, ESLsyn).

Submissions