All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marco Elver <elver@google.com>
To: elver@google.com, Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	linux-perf-users@vger.kernel.org, x86@kernel.org,
	linux-sh@vger.kernel.org, kasan-dev@googlegroups.com,
	linux-kernel@vger.kernel.org
Subject: [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks
Date: Thu,  9 Jun 2022 13:30:38 +0200	[thread overview]
Message-ID: <20220609113046.780504-1-elver@google.com> (raw)

The hw_breakpoint subsystem's code has seen little change in over 10
years. In that time, systems with >100s of CPUs have become common,
along with improvements to the perf subsystem: using breakpoints on
thousands of concurrent tasks should be a supported usecase.

The breakpoint constraints accounting algorithm is the major bottleneck
in doing so:

  1. task_bp_pinned() has been O(#tasks), and called twice for each CPU.

  2. Everything is serialized on a global mutex, 'nr_bp_mutex'.

This series first optimizes task_bp_pinned() to only take O(1) on
average, and then reworks synchronization to allow concurrency when
checking and updating breakpoint constraints for tasks. Along the way,
smaller micro-optimizations and cleanups are done as they seemed obvious
when staring at the code (but likely insignificant).

The result is (on a system with 256 CPUs) that we go from:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
	 	[ ^ more aggressive benchmark parameters took too long ]
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

... to -- with all optimizations:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.071 [sec]
 |
 |       37.134896 usecs/op
 |     2376.633333 usecs/op/cpu

On the used test system, that's an effective speedup of ~3315x per op.

Which is close to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- for reference, constraints
accounting disabled:

 | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.067 [sec]
 |
 |       35.286458 usecs/op
 |     2258.333333 usecs/op/cpu

At this point, the current implementation is only ~5% slower than the
theoretical ideal. However, given constraints accounting cannot
realistically be disabled, this is likely as far as we can push it.

Marco Elver (8):
  perf/hw_breakpoint: Optimize list of per-task breakpoints
  perf/hw_breakpoint: Mark data __ro_after_init
  perf/hw_breakpoint: Optimize constant number of breakpoint slots
  perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  perf/hw_breakpoint: Remove useless code related to flexible
    breakpoints
  perf/hw_breakpoint: Reduce contention with large number of tasks
  perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  perf/hw_breakpoint: Clean up headers

 arch/sh/include/asm/hw_breakpoint.h  |   5 +-
 arch/x86/include/asm/hw_breakpoint.h |   5 +-
 include/linux/hw_breakpoint.h        |   1 -
 include/linux/perf_event.h           |   3 +-
 kernel/events/hw_breakpoint.c        | 374 +++++++++++++++++++--------
 5 files changed, 276 insertions(+), 112 deletions(-)

-- 
2.36.1.255.ge46751e96f-goog

             reply	other threads:[~2022-06-09 11:31 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09 11:30 Marco Elver [this message]
2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-09 12:30   ` Dmitry Vyukov
2022-06-09 12:53     ` Marco Elver
2022-06-09 13:05       ` Dmitry Vyukov
2022-06-09 14:29   ` Dmitry Vyukov
2022-06-09 14:55     ` Marco Elver
2022-06-09 16:53       ` Dmitry Vyukov
2022-06-09 18:37         ` Marco Elver
2022-06-10  9:04           ` Dmitry Vyukov
2022-06-10  9:36             ` Marco Elver
2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-09 11:45   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-09 11:55   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-09 12:03   ` Dmitry Vyukov
2022-06-09 12:08     ` Marco Elver
2022-06-09 12:23       ` Dmitry Vyukov
2022-06-09 13:25     ` Peter Zijlstra
2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-09 12:04   ` Dmitry Vyukov
2022-06-09 13:41     ` Dmitry Vyukov
2022-06-09 14:00       ` Marco Elver
2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-09 13:03   ` Dmitry Vyukov
2022-06-09 13:29     ` Marco Elver
2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
2022-06-09 15:00   ` Dmitry Vyukov
2022-06-10  8:25     ` Marco Elver
2022-06-10  9:13       ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-09 12:11   ` Dmitry Vyukov
2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220609113046.780504-1-elver@google.com \
    --to=elver@google.com \
    --cc=acme@kernel.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=dvyukov@google.com \
    --cc=frederic@kernel.org \
    --cc=jolsa@redhat.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.