From: Marco Elver <elver@google.com> To: elver@google.com, Peter Zijlstra <peterz@infradead.org>, Frederic Weisbecker <frederic@kernel.org>, Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com>, linux-sh@vger.kernel.org, Alexander Shishkin <alexander.shishkin@linux.intel.com>, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, Arnaldo Carvalho de Melo <acme@kernel.org>, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, kasan-dev@googlegroups.com, Namhyung Kim <namhyung@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Jiri Olsa <jolsa@redhat.com>, Dmitry Vyukov <dvyukov@google.com> Subject: [PATCH v4 00/14] perf/hw_breakpoint: Optimize for thousands of tasks Date: Mon, 29 Aug 2022 14:47:05 +0200 [thread overview] Message-ID: <20220829124719.675715-1-elver@google.com> (raw) The hw_breakpoint subsystem's code has seen little change in over 10 years. In that time, systems with >100s of CPUs have become common, along with improvements to the perf subsystem: using breakpoints on thousands of concurrent tasks should be a supported usecase. The breakpoint constraints accounting algorithm is the major bottleneck in doing so: 1. toggle_bp_slot() and fetch_bp_busy_slots() are O(#cpus * #tasks): Both iterate through all CPUs and call task_bp_pinned(), which is O(#tasks). 2. Everything is serialized on a global mutex, 'nr_bp_mutex'. The series progresses with the simpler optimizations and finishes with the more complex optimizations: 1. We first optimize task_bp_pinned() to only take O(1) on average. 2. Rework synchronization to allow concurrency when checking and updating breakpoint constraints for tasks. 3. Eliminate the O(#cpus) loops in the CPU-independent case. Along the way, smaller micro-optimizations and cleanups are done as they seemed obvious when staring at the code (but likely insignificant). The result is (on a system with 256 CPUs) that we go from: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 [ ^ more aggressive benchmark parameters took too long ] | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu ... to the following with all optimizations: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.292187 usecs/op | 2258.700000 usecs/op/cpu On the used test system, that's an effective speedup of ~3490x per op. Which is on par with the theoretical ideal performance through optimizations in hw_breakpoint.c (constraints accounting disabled), and only 12% slower than no breakpoints at all. Changelog --------- v4: * Fix percpu_is_read_locked(): Due to spurious read_count increments in __percpu_down_read_trylock() if sem->block != 0, check that !sem->block (reported by Peter). * Apply Reviewed/Acked-by. v3: https://lkml.kernel.org/r/20220704150514.48816-1-elver@google.com * Fix typos. * Introduce hw_breakpoint_is_used() for the test. * Add WARN_ON in bp_blots_histogram_add(). * Don't use raw_smp_processor_id() in test. * Apply Acked-by/Reviewed-by given in v2 for mostly unchanged patches. v2: https://lkml.kernel.org/r/20220628095833.2579903-1-elver@google.com * Add KUnit test suite. * Remove struct bp_busy_slots and simplify functions. * Add "powerpc/hw_breakpoint: Avoid relying on caller synchronization". * Add "locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()". * Use percpu-rwsem instead of rwlock. * Use task_struct::perf_event_mutex instead of sharded mutex. * Drop v1 "perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent". * Add "perf/hw_breakpoint: Introduce bp_slots_histogram". * Add "perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets". * Add "perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets". * Apply Acked-by/Reviewed-by given in v1 for unchanged patches. ==> Speedup of ~3490x (vs. ~3315x in v1). v1: https://lore.kernel.org/all/20220609113046.780504-1-elver@google.com/ Marco Elver (14): perf/hw_breakpoint: Add KUnit test for constraints accounting perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test perf/hw_breakpoint: Clean up headers perf/hw_breakpoint: Optimize list of per-task breakpoints perf/hw_breakpoint: Mark data __ro_after_init perf/hw_breakpoint: Optimize constant number of breakpoint slots perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable perf/hw_breakpoint: Remove useless code related to flexible breakpoints powerpc/hw_breakpoint: Avoid relying on caller synchronization locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() perf/hw_breakpoint: Reduce contention with large number of tasks perf/hw_breakpoint: Introduce bp_slots_histogram perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets arch/powerpc/kernel/hw_breakpoint.c | 53 ++- arch/sh/include/asm/hw_breakpoint.h | 5 +- arch/x86/include/asm/hw_breakpoint.h | 5 +- include/linux/hw_breakpoint.h | 4 +- include/linux/percpu-rwsem.h | 6 + include/linux/perf_event.h | 3 +- kernel/events/Makefile | 1 + kernel/events/hw_breakpoint.c | 638 ++++++++++++++++++++------- kernel/events/hw_breakpoint_test.c | 333 ++++++++++++++ kernel/locking/percpu-rwsem.c | 6 + lib/Kconfig.debug | 10 + 11 files changed, 885 insertions(+), 179 deletions(-) create mode 100644 kernel/events/hw_breakpoint_test.c -- 2.37.2.672.g94769d06f0-goog
WARNING: multiple messages have this Message-ID (diff)
From: Marco Elver <elver@google.com> To: elver@google.com, Peter Zijlstra <peterz@infradead.org>, Frederic Weisbecker <frederic@kernel.org>, Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>, Dmitry Vyukov <dvyukov@google.com>, Michael Ellerman <mpe@ellerman.id.au>, linuxppc-dev@lists.ozlabs.org, linux-perf-users@vger.kernel.org, x86@kernel.org, linux-sh@vger.kernel.org, kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org Subject: [PATCH v4 00/14] perf/hw_breakpoint: Optimize for thousands of tasks Date: Mon, 29 Aug 2022 14:47:05 +0200 [thread overview] Message-ID: <20220829124719.675715-1-elver@google.com> (raw) The hw_breakpoint subsystem's code has seen little change in over 10 years. In that time, systems with >100s of CPUs have become common, along with improvements to the perf subsystem: using breakpoints on thousands of concurrent tasks should be a supported usecase. The breakpoint constraints accounting algorithm is the major bottleneck in doing so: 1. toggle_bp_slot() and fetch_bp_busy_slots() are O(#cpus * #tasks): Both iterate through all CPUs and call task_bp_pinned(), which is O(#tasks). 2. Everything is serialized on a global mutex, 'nr_bp_mutex'. The series progresses with the simpler optimizations and finishes with the more complex optimizations: 1. We first optimize task_bp_pinned() to only take O(1) on average. 2. Rework synchronization to allow concurrency when checking and updating breakpoint constraints for tasks. 3. Eliminate the O(#cpus) loops in the CPU-independent case. Along the way, smaller micro-optimizations and cleanups are done as they seemed obvious when staring at the code (but likely insignificant). The result is (on a system with 256 CPUs) that we go from: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 [ ^ more aggressive benchmark parameters took too long ] | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu ... to the following with all optimizations: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.292187 usecs/op | 2258.700000 usecs/op/cpu On the used test system, that's an effective speedup of ~3490x per op. Which is on par with the theoretical ideal performance through optimizations in hw_breakpoint.c (constraints accounting disabled), and only 12% slower than no breakpoints at all. Changelog --------- v4: * Fix percpu_is_read_locked(): Due to spurious read_count increments in __percpu_down_read_trylock() if sem->block != 0, check that !sem->block (reported by Peter). * Apply Reviewed/Acked-by. v3: https://lkml.kernel.org/r/20220704150514.48816-1-elver@google.com * Fix typos. * Introduce hw_breakpoint_is_used() for the test. * Add WARN_ON in bp_blots_histogram_add(). * Don't use raw_smp_processor_id() in test. * Apply Acked-by/Reviewed-by given in v2 for mostly unchanged patches. v2: https://lkml.kernel.org/r/20220628095833.2579903-1-elver@google.com * Add KUnit test suite. * Remove struct bp_busy_slots and simplify functions. * Add "powerpc/hw_breakpoint: Avoid relying on caller synchronization". * Add "locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()". * Use percpu-rwsem instead of rwlock. * Use task_struct::perf_event_mutex instead of sharded mutex. * Drop v1 "perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent". * Add "perf/hw_breakpoint: Introduce bp_slots_histogram". * Add "perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets". * Add "perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets". * Apply Acked-by/Reviewed-by given in v1 for unchanged patches. ==> Speedup of ~3490x (vs. ~3315x in v1). v1: https://lore.kernel.org/all/20220609113046.780504-1-elver@google.com/ Marco Elver (14): perf/hw_breakpoint: Add KUnit test for constraints accounting perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test perf/hw_breakpoint: Clean up headers perf/hw_breakpoint: Optimize list of per-task breakpoints perf/hw_breakpoint: Mark data __ro_after_init perf/hw_breakpoint: Optimize constant number of breakpoint slots perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable perf/hw_breakpoint: Remove useless code related to flexible breakpoints powerpc/hw_breakpoint: Avoid relying on caller synchronization locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() perf/hw_breakpoint: Reduce contention with large number of tasks perf/hw_breakpoint: Introduce bp_slots_histogram perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets arch/powerpc/kernel/hw_breakpoint.c | 53 ++- arch/sh/include/asm/hw_breakpoint.h | 5 +- arch/x86/include/asm/hw_breakpoint.h | 5 +- include/linux/hw_breakpoint.h | 4 +- include/linux/percpu-rwsem.h | 6 + include/linux/perf_event.h | 3 +- kernel/events/Makefile | 1 + kernel/events/hw_breakpoint.c | 638 ++++++++++++++++++++------- kernel/events/hw_breakpoint_test.c | 333 ++++++++++++++ kernel/locking/percpu-rwsem.c | 6 + lib/Kconfig.debug | 10 + 11 files changed, 885 insertions(+), 179 deletions(-) create mode 100644 kernel/events/hw_breakpoint_test.c -- 2.37.2.672.g94769d06f0-goog
next reply other threads:[~2022-08-29 12:48 UTC|newest] Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-08-29 12:47 Marco Elver [this message] 2022-08-29 12:47 ` [PATCH v4 00/14] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver 2022-08-29 12:47 ` [PATCH v4 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 02/14] perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 03/14] perf/hw_breakpoint: Clean up headers Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 05/14] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 06/14] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 07/14] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 08/14] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 09/14] powerpc/hw_breakpoint: Avoid relying on caller synchronization Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 10/14] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 12/14] perf/hw_breakpoint: Introduce bp_slots_histogram Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 13/14] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver 2022-08-29 12:47 ` [PATCH v4 14/14] perf/hw_breakpoint: Optimize toggle_bp_slot() " Marco Elver 2022-08-29 12:47 ` Marco Elver 2022-09-01 8:12 ` [tip: perf/core] " tip-bot2 for Marco Elver
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220829124719.675715-1-elver@google.com \ --to=elver@google.com \ --cc=acme@kernel.org \ --cc=alexander.shishkin@linux.intel.com \ --cc=dvyukov@google.com \ --cc=frederic@kernel.org \ --cc=jolsa@redhat.com \ --cc=kasan-dev@googlegroups.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-perf-users@vger.kernel.org \ --cc=linux-sh@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=mark.rutland@arm.com \ --cc=mingo@kernel.org \ --cc=namhyung@kernel.org \ --cc=peterz@infradead.org \ --cc=tglx@linutronix.de \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.