All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marco Elver <elver@google.com>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	linux-perf-users@vger.kernel.org, x86@kernel.org,
	linux-sh@vger.kernel.org, kasan-dev@googlegroups.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
Date: Fri, 10 Jun 2022 10:25:00 +0200	[thread overview]
Message-ID: <CANpmjNNwOOYxOXLixrUD25YoszYcy7SRwXMMfrj5zZvrETkp0g@mail.gmail.com> (raw)
In-Reply-To: <CACT4Y+bGPLampPm7JHJeXeK_CwQ2_=3mRktPCh7T9r3y8r02hw@mail.gmail.com>

On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
<kasan-dev@googlegroups.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> >
> > Running the perf benchmark with (note: more aggressive parameters vs.
> > preceding changes, but same host with 256 CPUs):
> >
> >  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> >  | # Running 'breakpoint/thread' benchmark:
> >  | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> >  |      Total time: 1.953 [sec]
> >  |
> >  |       38.146289 usecs/op
> >  |     4882.725000 usecs/op/cpu
> >
> >     16.29%  [kernel]       [k] rhashtable_jhash2
> >     16.19%  [kernel]       [k] osq_lock
> >     14.22%  [kernel]       [k] queued_spin_lock_slowpath
> >      8.58%  [kernel]       [k] task_bp_pinned
> >      8.30%  [kernel]       [k] mutex_spin_on_owner
> >      4.03%  [kernel]       [k] smp_cfm_core_cond
> >      2.97%  [kernel]       [k] toggle_bp_slot
> >      2.94%  [kernel]       [k] bcmp
> >
> > We can see that a majority of the time is now spent hashing task
> > pointers to index into task_bps_ht in task_bp_pinned().
> >
> > However, if task_bp_pinned()'s computation is independent of any CPU,
> > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > identical. With increasing CPU-count, this problem worsens.
> >
> > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > and cache the result. Use the cached result instead of a call to
> > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > if the cached result can be used.
> >
> > After this optimization:
> >
> >     21.96%  [kernel]       [k] queued_spin_lock_slowpath
> >     16.39%  [kernel]       [k] osq_lock
> >      9.82%  [kernel]       [k] toggle_bp_slot
> >      9.81%  [kernel]       [k] find_next_bit
> >      4.93%  [kernel]       [k] mutex_spin_on_owner
> >      4.71%  [kernel]       [k] smp_cfm_core_cond
> >      4.30%  [kernel]       [k] __reserve_bp_slot
> >      2.65%  [kernel]       [k] cpumask_next
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, however, we see no statistically
> > significant improvement in performance on the test system with 256 CPUs.
> > This is very likely due to the benchmark parameters being too aggressive
> > and contention elsewhere becoming dominant.
> >
> > Indeed, when using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> >  | # Running 'breakpoint/thread' benchmark:
> >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> >  |      Total time: 0.071 [sec]
> >  |
> >  |       37.134896 usecs/op
> >  |     2376.633333 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without this optimization
> > (baseline is 42 usecs/op). This is now only 5% slower than the
> > theoretical ideal (constraints disabled), and 18% slower than no
> > breakpoints at all.
> >
> > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> >   more consistent with other functions (which have bp first, before the
> >   cpu argument). ]
>
> There are 3 main cases:
> 1. Per-cpu bp.

Yes, CPU-target breakpoint on just 1 CPU.

> 2. Per-task and per-cpu bp.

Task-target breakpoint but pinned to 1 CPU.

> 3. Per-task bp (on all cpus)

Task-target breakpoint that can run on all CPUs.

> right?
>
> For case 1 we still seem to do lots of unnecessary work in
> fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> only the CPU's cpu_pinned, so that's the only CPU we need to
> fetch/check.

It'll just do 1 iteration, because cpumask_of_bp() will return a mask
with just the event's target CPU in it.

> For case 2 we also do lots of unnecessary work, again we also need to
> check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> atomic_dec/inc on all other CPUs (they dec/inc the same variable).

Same as above, just 1 iteration because cpumask_of_bp() does the right
thing. cached_tbp_pinned may still be used if all existing task
breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
task_bp_pinned() once or twice probably is irrelevant in this case).

> Case 3 is the only one when we need to check all CPUs and
> cached_tbp_pinned may be useful.
> But I wonder if we could instead add a per-task
> has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> then we could optimize the code even more by making it O(1) instead of
> O(N).

> Namely, we add global tsk_pinned for tasks that don't have
> per-cpu breakpoints, and we update only that tsk_pinned instead of
> iterating over all CPUs.

That seems reasonable.

> I think this will require adding cpu_pinned as well (similar to
> tsk_pinned but aggregated over all CPUs).

> Then the fast path capacity check can become just:
>
> if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
>   if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> hw_breakpoint_weight(bp) > nr_slots[type])
>     return -ENOSPC;
> }
>
> Does it make any sense?

Yes, I think this might work. I'll see if I can make it work.

  reply	other threads:[~2022-06-10  8:27 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-09 12:30   ` Dmitry Vyukov
2022-06-09 12:53     ` Marco Elver
2022-06-09 13:05       ` Dmitry Vyukov
2022-06-09 14:29   ` Dmitry Vyukov
2022-06-09 14:55     ` Marco Elver
2022-06-09 16:53       ` Dmitry Vyukov
2022-06-09 18:37         ` Marco Elver
2022-06-10  9:04           ` Dmitry Vyukov
2022-06-10  9:36             ` Marco Elver
2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-09 11:45   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-09 11:55   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-09 12:03   ` Dmitry Vyukov
2022-06-09 12:08     ` Marco Elver
2022-06-09 12:23       ` Dmitry Vyukov
2022-06-09 13:25     ` Peter Zijlstra
2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-09 12:04   ` Dmitry Vyukov
2022-06-09 13:41     ` Dmitry Vyukov
2022-06-09 14:00       ` Marco Elver
2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-09 13:03   ` Dmitry Vyukov
2022-06-09 13:29     ` Marco Elver
2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
2022-06-09 15:00   ` Dmitry Vyukov
2022-06-10  8:25     ` Marco Elver [this message]
2022-06-10  9:13       ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-09 12:11   ` Dmitry Vyukov
2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANpmjNNwOOYxOXLixrUD25YoszYcy7SRwXMMfrj5zZvrETkp0g@mail.gmail.com \
    --to=elver@google.com \
    --cc=acme@kernel.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=dvyukov@google.com \
    --cc=frederic@kernel.org \
    --cc=jolsa@redhat.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.