All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dmitry Vyukov <dvyukov@google.com>
To: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	linux-perf-users@vger.kernel.org, x86@kernel.org,
	linux-sh@vger.kernel.org, kasan-dev@googlegroups.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
Date: Thu, 9 Jun 2022 17:00:22 +0200	[thread overview]
Message-ID: <CACT4Y+bGPLampPm7JHJeXeK_CwQ2_=3mRktPCh7T9r3y8r02hw@mail.gmail.com> (raw)
In-Reply-To: <20220609113046.780504-8-elver@google.com>

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Running the perf benchmark with (note: more aggressive parameters vs.
> preceding changes, but same host with 256 CPUs):
>
>  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
>  |      Total time: 1.953 [sec]
>  |
>  |       38.146289 usecs/op
>  |     4882.725000 usecs/op/cpu
>
>     16.29%  [kernel]       [k] rhashtable_jhash2
>     16.19%  [kernel]       [k] osq_lock
>     14.22%  [kernel]       [k] queued_spin_lock_slowpath
>      8.58%  [kernel]       [k] task_bp_pinned
>      8.30%  [kernel]       [k] mutex_spin_on_owner
>      4.03%  [kernel]       [k] smp_cfm_core_cond
>      2.97%  [kernel]       [k] toggle_bp_slot
>      2.94%  [kernel]       [k] bcmp
>
> We can see that a majority of the time is now spent hashing task
> pointers to index into task_bps_ht in task_bp_pinned().
>
> However, if task_bp_pinned()'s computation is independent of any CPU,
> i.e. always `iter->cpu < 0`, the result for each invocation will be
> identical. With increasing CPU-count, this problem worsens.
>
> Instead, identify if every call to task_bp_pinned() is CPU-independent,
> and cache the result. Use the cached result instead of a call to
> task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> if the cached result can be used.
>
> After this optimization:
>
>     21.96%  [kernel]       [k] queued_spin_lock_slowpath
>     16.39%  [kernel]       [k] osq_lock
>      9.82%  [kernel]       [k] toggle_bp_slot
>      9.81%  [kernel]       [k] find_next_bit
>      4.93%  [kernel]       [k] mutex_spin_on_owner
>      4.71%  [kernel]       [k] smp_cfm_core_cond
>      4.30%  [kernel]       [k] __reserve_bp_slot
>      2.65%  [kernel]       [k] cpumask_next
>
> Showing that the time spent hashing keys has become insignificant.
>
> With the given benchmark parameters, however, we see no statistically
> significant improvement in performance on the test system with 256 CPUs.
> This is very likely due to the benchmark parameters being too aggressive
> and contention elsewhere becoming dominant.
>
> Indeed, when using the less aggressive parameters from the preceding
> changes, we now observe:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.071 [sec]
>  |
>  |       37.134896 usecs/op
>  |     2376.633333 usecs/op/cpu
>
> Which is an improvement of 12% compared to without this optimization
> (baseline is 42 usecs/op). This is now only 5% slower than the
> theoretical ideal (constraints disabled), and 18% slower than no
> breakpoints at all.
>
> [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
>   more consistent with other functions (which have bp first, before the
>   cpu argument). ]

There are 3 main cases:
1. Per-cpu bp.
2. Per-task and per-cpu bp.
3. Per-task bp (on all cpus)
right?

For case 1 we still seem to do lots of unnecessary work in
fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
only the CPU's cpu_pinned, so that's the only CPU we need to
fetch/check.

For case 2 we also do lots of unnecessary work, again we also need to
check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
atomic_dec/inc on all other CPUs (they dec/inc the same variable).

Case 3 is the only one when we need to check all CPUs and
cached_tbp_pinned may be useful.
But I wonder if we could instead add a per-task
has_per_cpu_breakpoints flag. Then if the flag is set, we check all
CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
then we could optimize the code even more by making it O(1) instead of
O(N). Namely, we add global tsk_pinned for tasks that don't have
per-cpu breakpoints, and we update only that tsk_pinned instead of
iterating over all CPUs.
I think this will require adding cpu_pinned as well (similar to
tsk_pinned but aggregated over all CPUs).
Then the fast path capacity check can become just:

if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
  if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
hw_breakpoint_weight(bp) > nr_slots[type])
    return -ENOSPC;
}

Does it make any sense?

  reply	other threads:[~2022-06-09 15:00 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-09 12:30   ` Dmitry Vyukov
2022-06-09 12:53     ` Marco Elver
2022-06-09 13:05       ` Dmitry Vyukov
2022-06-09 14:29   ` Dmitry Vyukov
2022-06-09 14:55     ` Marco Elver
2022-06-09 16:53       ` Dmitry Vyukov
2022-06-09 18:37         ` Marco Elver
2022-06-10  9:04           ` Dmitry Vyukov
2022-06-10  9:36             ` Marco Elver
2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-09 11:45   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-09 11:55   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-09 12:03   ` Dmitry Vyukov
2022-06-09 12:08     ` Marco Elver
2022-06-09 12:23       ` Dmitry Vyukov
2022-06-09 13:25     ` Peter Zijlstra
2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-09 12:04   ` Dmitry Vyukov
2022-06-09 13:41     ` Dmitry Vyukov
2022-06-09 14:00       ` Marco Elver
2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-09 13:03   ` Dmitry Vyukov
2022-06-09 13:29     ` Marco Elver
2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
2022-06-09 15:00   ` Dmitry Vyukov [this message]
2022-06-10  8:25     ` Marco Elver
2022-06-10  9:13       ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-09 12:11   ` Dmitry Vyukov
2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACT4Y+bGPLampPm7JHJeXeK_CwQ2_=3mRktPCh7T9r3y8r02hw@mail.gmail.com' \
    --to=dvyukov@google.com \
    --cc=acme@kernel.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=elver@google.com \
    --cc=frederic@kernel.org \
    --cc=jolsa@redhat.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.