From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754540AbbCBP5E (ORCPT ); Mon, 2 Mar 2015 10:57:04 -0500 Received: from mail-wi0-f172.google.com ([209.85.212.172]:42616 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753784AbbCBP5A (ORCPT ); Mon, 2 Mar 2015 10:57:00 -0500 From: Daniel Thompson To: Thomas Gleixner , John Stultz Cc: Daniel Thompson , linux-kernel@vger.kernel.org, patches@linaro.org, linaro-kernel@lists.linaro.org, Sumit Semwal , Stephen Boyd , Steven Rostedt Subject: [PATCH v5 0/5] sched_clock: Optimize and avoid deadlock during read from NMI Date: Mon, 2 Mar 2015 15:56:39 +0000 Message-Id: <1425311804-3392-1-git-send-email-daniel.thompson@linaro.org> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1421859236-19782-1-git-send-email-daniel.thompson@linaro.org> References: <1421859236-19782-1-git-send-email-daniel.thompson@linaro.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patchset optimizes the generic sched_clock implementation by removing branches and significantly reducing the data cache profile. It also makes it safe to call sched_clock() from NMI (or FIQ on ARM). The data cache profile of sched_clock() in the original code is somewhere between 2 and 3 (64-byte) cache lines, depending on alignment of struct clock_data. After patching, the cache profile for the normal case should be a single cacheline. NMI safety was tested on i.MX6 with perf drowning the system in FIQs and using the perf handler to check that sched_clock() returned monotonic values. At the same time I forcefully reduced kt_wrap so that update_sched_clock() is being called at >1000Hz. Without the patches the above system is grossly unstable, surviving [9K,115K,25K] perf event cycles during three separate runs. With the patch I ran for over 9M perf event cycles before getting bored. Performance testing has primarily been performed using a simple tight loop test (i.e. one that is unlikely to benefit from the cache profile improvements). Summary results show benefit on all CPUs although magnitude varies significantly: Cortex A9 @ 792MHz 4.1% speedup Cortex A9 @ 1GHz 0.4% speedup (different SoC to above) Scorpian 13.6% speedup Krait 35.1% speedup Cortex A53 @ 1GHz 1.6% speedup Cortex A57 @ 1GHz 5.0% speedup Benchmarking was done by Stephen Boyd and myself, full data for the above summaries can be found here: https://docs.google.com/spreadsheets/d/1Zd2xN42U4oAVZcArqAYdAWgFI5oDFRysURCSYNmBpZA/edit?usp=sharing v5: * Summarized benchmark results in the patchset cover letter and added some Reviewed-by:s. * Rebased on 4.0-rc1. v4: * Optimized sched_clock() to be branchless by introducing a dummy function to provide clock values while the clock is suspended (Stephen Boyd). * Improved commenting, including the kerneldoc comments (Stephen Boyd). * Removed a redundant notrace from the update logic (Steven Rostedt). v3: * Optimized to minimise cache profile, including elimination of the suspended flag (Thomas Gleixner). * Replaced the update_bank_begin/end with a single update function (Thomas Gleixner). * Split into multiple patches to aid review. v2: * Extended the scope of the read lock in sched_clock() so we can bank all data consumed there (John Stultz) Daniel Thompson (5): sched_clock: Match scope of read and write seqcounts sched_clock: Optimize cache line usage sched_clock: Remove suspend from clock_read_data sched_clock: Remove redundant notrace from update function sched_clock: Avoid deadlock during read from NMI kernel/time/sched_clock.c | 195 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 138 insertions(+), 57 deletions(-) -- 2.1.0