4.6 Managing Multi-Core and Flash Memory

Time	Label	Presentation Title Authors
17:00	4.6.1	DISTRIBUTED FAIR SCHEDULING FOR MANY-CORES Speaker: Anuj Pathania, Karlsruhe Institute of Technology (KIT), DE Authors: Anuj Pathania¹, Vanchinathan Venkataramani², Muhammad Shafique¹, Tulika Mitra² and Jörg Henkel¹ ¹Karlsruhe Institute of Technology (KIT), DE; ²National University of Singapore, SG Abstract Transition of embedded processors from multi-cores to many-cores continues unabated. Many-cores execute tens of tasks in parallel and in some contexts, it is crucial that the processing cores are distributed fairly amongst the tasks. Traditional queue-based centralized fair schedulers designed for multi-cores will have excessive overhead on many-cores due to the enlarged optimization search-space. Further, the processing requirements of executing tasks may vary under different phases of their execution necessitating lightweight dynamic fair schedulers to regularly perform partial reallocation of the cores. We introduce a distributed dynamic fair scheduler that can scale up with the increase in number of cores because it disburses the processing overhead of scheduling amongst all the cores. Based on observations made for task executions on many-cores, we propose an optimal solution under certain constraints for the fair scheduling problem, which in general is NP-Hard. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.6.2	KEEP IT SLOW AND IN TIME: ONLINE DVFS WITH HARD REAL-TIME WORKLOADS Speaker: Kai Lampka, Uppsala University, SE Authors: Kai Lampka and Björn Forsberg, Uppsala University, SE Abstract To handle hot spots or power shortages, modern multicore processors are equipped with a supervisory dynamic thermal and power management (DTPM) system. When necessary, the DTPM system autonomously adapts the capacity of the cooling system or throttles the speed of core-local clocks via dynamic voltage and frequency scaling (DVFS) techniques. Opposed to best-effort scenarios, online DVFS with real-time workloads also needs to consider completion times of computations. Whereas execution times can be bounded adequately with worst-case estimates, arrival times of computation requests are potentially unknown. A deadline for completing a computation can easily be missed, if workloads suddenly peak and past clock speed assignments have built-up a non-negligible backlog of computations. To overcome this problem, we introduce an online DVFS management scheme which is history-aware. It operates a core at higher speed levels only if the future workload has the potential to result in timing violations, if not anticipated by rising clock speed assignments. We present an implementation of the scheme running on the Gem5 hardware simulator. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.6.3	EXPLOITING PROCESS VARIATION FOR RETENTION INDUCED REFRESH MINIMIZATION ON FLASH MEMORY Speaker: Yejia Di, Chongqing University, CN Authors: Yejia Di¹, Liang Shi¹, Kaijie Wu¹ and Chun Jason Xue² ¹Chongqing University, CN; ²City University of Hong Kong, HK Abstract Solid state drives (SSDs) are becoming the default storage medium with the cost dropping of NAND flash memory. However, the cost dropping driven by the density improvement and technology scaling would bring in new challenges. One challenge is the overwhelmingly decreasing retention time. The duration of time for which the data written in flash memory cells can be read reliably is called retention time. To deal with the decreasing retention time, refresh has been highly recommended. However, refresh will seriously hurt the performance and lifetime, especially at the end life of flash memory. The second challenge is the process variation (PV). Significant PV has been observed in flash memory, which introduces large variations in the endurance of flash blocks. Blocks with high-endurance can provide long retention time, while the retention time is short for low-endurance blocks. Considering these two challenges, a novel refresh minimization scheme is proposed for lifetime and performance improvement. The main idea of the proposed approach is to allocate high endurance blocks to the data with long retention time requirement in priority. In this way, the refresh operations can be minimized. Implementation and analysis show that the overhead of the proposed work is negligible. Simulation results show that both the lifetime and performance are significantly improved over the state-of-the-art scheme. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP2-3, 253	WORKLOAD-AWARE POWER OPTIMIZATION STRATEGY FOR ASYMMETRIC MULTIPROCESSORS Speaker: Emanuele Del Sozzo, Politecnico di Milano, IT Authors: Emanuele Del Sozzo, Gianluca Durelli, Ettore Trainiti, Antonio Miele, Marco Domenico Santambrogio and Cristiana Bolchini, Politecnico di Milano, IT Abstract Asymmetric multi-core architectures, such as the ARM big.LITTLE, are emerging as successful solutions for the embedded and mobile markets due to their capabilities to trade-off performance and power consumption. However, both the HMP scheduler integrated in the commercial products and the previous research approaches are not able to fully exploit such potentiality. We propose a new runtime resource management policy for the big.LITTLE architecture integrated in Linux aimed at optimizing the power consumption while fulfilling performance requirements specified for the running applications. Experimental results show an improvement of the 11% on the performance and at the same time 8% in peak power consumption w.r.t. the current Linux HMP solution. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP2-4, 18	(Best Paper Award Candidate) THE SLOWDOWN OR RACE-TO-IDLE QUESTION: WORKLOAD-AWARE ENERGY OPTIMIZATION OF SMT MULTICORE PLATFORMS UNDER PROCESS VARIATION Speaker: Anup Das, University of Southampton, GB Authors: Anup Das, Geoff Merrett and Bashir Al-Hashimi, University of Southampton, GB Abstract Increasing use of high performance applications on multicore platforms has proliferated energy consumption, transforming this as a primary design optimization objective. Two widely used approaches for reducing energy consumption in multithreaded workloads are slowdown (using DVFS) and race-to-idle. In this paper, we first demonstrate that most energy efficient choice is dependent on (1) workload (memory bound, CPU bound etc.), (2) process variation and (3) support for Simultaneous Multithreading (SMT). We then propose an approach for mapping application threads on SMT multicore systems at runtime, to minimize energy consumption. The proposed approach interfaces with the operating system and hardware performance counters and timers to characterize application threads. This characterization captures the effect of process variation on execution time and identifies the break-even operating point, where one strategy (slowdown or race-to-idle) outperforms the other. Thread mapping is performed using these characterized data by iteratively collapsing application threads (SMT) followed by binary programming-based thread mapping. Finally, performance slack is exploited at run-time to select between slowdown and race-to-idle, based upon the break-even operating point calculated for each individual thread. This end-to-end approach is implemented as a run-time manager for the Linux operating system and is validated across a range of high performance applications. Results demonstrate up to 13% energy reduction over all state-of-the-art approaches, with an average of 18% improvement over Linux. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP2-5, 165	TOWARDS GENERAL PURPOSE COMPUTATIONS ON LOW-END MOBILE GPUS Speaker: Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES Authors: Matina Maria Trompouki¹ and Leonidas Kosmidis² ¹Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES Abstract GPUs traditionally offer high computational capabilities, frequently higher than their CPU counterparts. While high-end mobile GPUs vendors introduced recently general purpose APIs, such as OpenCL, to leverage their computational power, the vast majority of the mobile devices lack such support. Despite that their graphics APIs have similarities with desktop graphics APIs, they have significant differences, which prevent the use of well-known techniques that offer general-purpose computations over such interfaces. In this paper we show how these obstacles can be overcome, in order to achieve general purpose programmability of these devices. As a proof of concept we implemented our proposal on a real embedded platform (Raspberry Pi) based on Broadcom's VideoCore IV GPU, obtaining a speedup of 7.2X over the CPU. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

4.6.1

DISTRIBUTED FAIR SCHEDULING FOR MANY-CORES
Speaker:
Anuj Pathania, Karlsruhe Institute of Technology (KIT), DE
Authors:
Anuj Pathania¹, Vanchinathan Venkataramani², Muhammad Shafique¹, Tulika Mitra² and Jörg Henkel¹
¹Karlsruhe Institute of Technology (KIT), DE; ²National University of Singapore, SG
Abstract
Transition of embedded processors from multi-cores to many-cores continues unabated. Many-cores execute tens of tasks in parallel and in some contexts, it is crucial that the processing cores are distributed fairly amongst the tasks. Traditional queue-based centralized fair schedulers designed for multi-cores will have excessive overhead on many-cores due to the enlarged optimization search-space. Further, the processing requirements of executing tasks may vary under different phases of their execution necessitating lightweight dynamic fair schedulers to regularly perform partial reallocation of the cores. We introduce a distributed dynamic fair scheduler that can scale up with the increase in number of cores because it disburses the processing overhead of scheduling amongst all the cores. Based on observations made for task executions on many-cores, we propose an optimal solution under certain constraints for the fair scheduling problem, which in general is NP-Hard.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.6.2

KEEP IT SLOW AND IN TIME: ONLINE DVFS WITH HARD REAL-TIME WORKLOADS
Speaker:
Kai Lampka, Uppsala University, SE
Authors:
Kai Lampka and Björn Forsberg, Uppsala University, SE
Abstract
To handle hot spots or power shortages, modern multicore processors are equipped with a supervisory dynamic thermal and power management (DTPM) system. When necessary, the DTPM system autonomously adapts the capacity of the cooling system or throttles the speed of core-local clocks via dynamic voltage and frequency scaling (DVFS) techniques. Opposed to best-effort scenarios, online DVFS with real-time workloads also needs to consider completion times of computations. Whereas execution times can be bounded adequately with worst-case estimates, arrival times of computation requests are potentially unknown. A deadline for completing a computation can easily be missed, if workloads suddenly peak and past clock speed assignments have built-up a non-negligible backlog of computations. To overcome this problem, we introduce an online DVFS management scheme which is history-aware. It operates a core at higher speed levels only if the future workload has the potential to result in timing violations, if not anticipated by rising clock speed assignments. We present an implementation of the scheme running on the Gem5 hardware simulator.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.6.3

EXPLOITING PROCESS VARIATION FOR RETENTION INDUCED REFRESH MINIMIZATION ON FLASH MEMORY
Speaker:
Yejia Di, Chongqing University, CN
Authors:
Yejia Di¹, Liang Shi¹, Kaijie Wu¹ and Chun Jason Xue²
¹Chongqing University, CN; ²City University of Hong Kong, HK
Abstract
Solid state drives (SSDs) are becoming the default storage medium with the cost dropping of NAND flash memory. However, the cost dropping driven by the density improvement and technology scaling would bring in new challenges. One challenge is the overwhelmingly decreasing retention time. The duration of time for which the data written in flash memory cells can be read reliably is called retention time. To deal with the decreasing retention time, refresh has been highly recommended. However, refresh will seriously hurt the performance and lifetime, especially at the end life of flash memory. The second challenge is the process variation (PV). Significant PV has been observed in flash memory, which introduces large variations in the endurance of flash blocks. Blocks with high-endurance can provide long retention time, while the retention time is short for low-endurance blocks. Considering these two challenges, a novel refresh minimization scheme is proposed for lifetime and performance improvement. The main idea of the proposed approach is to allocate high endurance blocks to the data with long retention time requirement in priority. In this way, the refresh operations can be minimized. Implementation and analysis show that the overhead of the proposed work is negligible. Simulation results show that both the lifetime and performance are significantly improved over the state-of-the-art scheme.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP2-3, 253

WORKLOAD-AWARE POWER OPTIMIZATION STRATEGY FOR ASYMMETRIC MULTIPROCESSORS
Speaker:
Emanuele Del Sozzo, Politecnico di Milano, IT
Authors:
Emanuele Del Sozzo, Gianluca Durelli, Ettore Trainiti, Antonio Miele, Marco Domenico Santambrogio and Cristiana Bolchini, Politecnico di Milano, IT
Abstract
Asymmetric multi-core architectures, such as the ARM big.LITTLE, are emerging as successful solutions for the embedded and mobile markets due to their capabilities to trade-off performance and power consumption. However, both the HMP scheduler integrated in the commercial products and the previous research approaches are not able to fully exploit such potentiality. We propose a new runtime resource management policy for the big.LITTLE architecture integrated in Linux aimed at optimizing the power consumption while fulfilling performance requirements specified for the running applications. Experimental results show an improvement of the 11% on the performance and at the same time 8% in peak power consumption w.r.t. the current Linux HMP solution.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP2-4, 18

(Best Paper Award Candidate)
THE SLOWDOWN OR RACE-TO-IDLE QUESTION: WORKLOAD-AWARE ENERGY OPTIMIZATION OF SMT MULTICORE PLATFORMS UNDER PROCESS VARIATION
Speaker:
Anup Das, University of Southampton, GB
Authors:
Anup Das, Geoff Merrett and Bashir Al-Hashimi, University of Southampton, GB
Abstract
Increasing use of high performance applications on multicore platforms has proliferated energy consumption, transforming this as a primary design optimization objective. Two widely used approaches for reducing energy consumption in multithreaded workloads are slowdown (using DVFS) and race-to-idle. In this paper, we first demonstrate that most energy efficient choice is dependent on (1) workload (memory bound, CPU bound etc.), (2) process variation and (3) support for Simultaneous Multithreading (SMT). We then propose an approach for mapping application threads on SMT multicore systems at runtime, to minimize energy consumption. The proposed approach interfaces with the operating system and hardware performance counters and timers to characterize application threads. This characterization captures the effect of process variation on execution time and identifies the break-even operating point, where one strategy (slowdown or race-to-idle) outperforms the other. Thread mapping is performed using these characterized data by iteratively collapsing application threads (SMT) followed by binary programming-based thread mapping. Finally, performance slack is exploited at run-time to select between slowdown and race-to-idle, based upon the break-even operating point calculated for each individual thread. This end-to-end approach is implemented as a run-time manager for the Linux operating system and is validated across a range of high performance applications. Results demonstrate up to 13% energy reduction over all state-of-the-art approaches, with an average of 18% improvement over Linux.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP2-5, 165

TOWARDS GENERAL PURPOSE COMPUTATIONS ON LOW-END MOBILE GPUS
Speaker:
Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors:
Matina Maria Trompouki¹ and Leonidas Kosmidis²
¹Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Abstract
GPUs traditionally offer high computational capabilities, frequently higher than their CPU counterparts. While high-end mobile GPUs vendors introduced recently general purpose APIs, such as OpenCL, to leverage their computational power, the vast majority of the mobile devices lack such support. Despite that their graphics APIs have similarities with desktop graphics APIs, they have significant differences, which prevent the use of well-known techniques that offer general-purpose computations over such interfaces. In this paper we show how these obstacles can be overcome, in order to achieve general purpose programmability of these devices. As a proof of concept we implemented our proposal on a real embedded platform (Raspberry Pi) based on Broadcom's VideoCore IV GPU, obtaining a speedup of 7.2X over the CPU.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session

Visit us at DATE 2016