All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marco Elver <elver@google.com>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	linux-sh@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	x86@kernel.org, linuxppc-dev@lists.ozlabs.org,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	kasan-dev@googlegroups.com, Namhyung Kim <namhyung@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jiri Olsa <jolsa@redhat.com>, Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets
Date: Tue, 28 Jun 2022 18:00:34 +0200	[thread overview]
Message-ID: <CANpmjNOh9gzzC7sOOOk1q7Ssj2dFxczj1bmufarYS2KupZQthg@mail.gmail.com> (raw)
In-Reply-To: <CACT4Y+aJZzkYHc+YJRApOLG-NYe8zXMaqxpQgQQFAy5WY97Ttg@mail.gmail.com>

On Tue, 28 Jun 2022 at 17:45, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Tue, 28 Jun 2022 at 11:59, Marco Elver <elver@google.com> wrote:
> >
> > We can still see that a majority of the time is spent hashing task pointers:
> >
> >     ...
> >     16.98%  [kernel]       [k] rhashtable_jhash2
> >     ...
> >
> > Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
> > calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
> > CPU-independent. The reason for this is to update the per-CPU
> > 'tsk_pinned' histogram.
> >
> > To optimize the CPU-independent case to O(1), keep a separate
> > CPU-independent 'tsk_pinned_all' histogram.
> >
> > The major source of complexity are transitions between "all
> > CPU-independent task breakpoints" and "mixed CPU-independent and
> > CPU-dependent task breakpoints". The code comments list all cases that
> > require handling.
> >
> > After this optimization:
> >
> >  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> >  |      Total time: 1.758 [sec]
> >  |
> >  |       34.336621 usecs/op
> >  |     4395.087500 usecs/op/cpu
> >
> >     38.08%  [kernel]       [k] queued_spin_lock_slowpath
> >     10.81%  [kernel]       [k] smp_cfm_core_cond
> >      3.01%  [kernel]       [k] update_sg_lb_stats
> >      2.58%  [kernel]       [k] osq_lock
> >      2.57%  [kernel]       [k] llist_reverse_order
> >      1.45%  [kernel]       [k] find_next_bit
> >      1.21%  [kernel]       [k] flush_tlb_func_common
> >      1.01%  [kernel]       [k] arch_install_hw_breakpoint
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, that's an improvement of 12%
> > compared with the old O(#cpus) version.
> >
> > And finally, using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> >  |      Total time: 0.067 [sec]
> >  |
> >  |       35.292187 usecs/op
> >  |     2258.700000 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without the histogram
> > optimizations (baseline is 40 usecs/op). This is now on par with the
> > theoretical ideal (constraints disabled), and only 12% slower than no
> > breakpoints at all.
> >
> > Signed-off-by: Marco Elver <elver@google.com>
>
> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
>
> I don't see any bugs. But the code is quite complex. Does it make
> sense to add some asserts to the histogram type? E.g. counters don't
> underflow, weight is not negative (e.g. accidentally added -1 returned
> from task_bp_pinned()). Not sure if it will be enough to catch all
> types of bugs, though.
> Could kunit tests check that histograms are all 0's at the end?
>
> I am not just about the current code (which may be correct), but also
> future modifications to this code.

I'll think of some more options.

bp_slots_histogram_max*() already has asserts (WARN about underflow;
some with KCSAN help).

The main thing I did to raise my own confidence in the code is inject
bugs and see if the KUnit test catches it. If it didn't I extended the
tests. I'll do that some more maybe.

WARNING: multiple messages have this Message-ID (diff)
From: Marco Elver <elver@google.com>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	linuxppc-dev@lists.ozlabs.org, linux-perf-users@vger.kernel.org,
	x86@kernel.org, linux-sh@vger.kernel.org,
	kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets
Date: Tue, 28 Jun 2022 18:00:34 +0200	[thread overview]
Message-ID: <CANpmjNOh9gzzC7sOOOk1q7Ssj2dFxczj1bmufarYS2KupZQthg@mail.gmail.com> (raw)
In-Reply-To: <CACT4Y+aJZzkYHc+YJRApOLG-NYe8zXMaqxpQgQQFAy5WY97Ttg@mail.gmail.com>

On Tue, 28 Jun 2022 at 17:45, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Tue, 28 Jun 2022 at 11:59, Marco Elver <elver@google.com> wrote:
> >
> > We can still see that a majority of the time is spent hashing task pointers:
> >
> >     ...
> >     16.98%  [kernel]       [k] rhashtable_jhash2
> >     ...
> >
> > Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
> > calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
> > CPU-independent. The reason for this is to update the per-CPU
> > 'tsk_pinned' histogram.
> >
> > To optimize the CPU-independent case to O(1), keep a separate
> > CPU-independent 'tsk_pinned_all' histogram.
> >
> > The major source of complexity are transitions between "all
> > CPU-independent task breakpoints" and "mixed CPU-independent and
> > CPU-dependent task breakpoints". The code comments list all cases that
> > require handling.
> >
> > After this optimization:
> >
> >  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> >  |      Total time: 1.758 [sec]
> >  |
> >  |       34.336621 usecs/op
> >  |     4395.087500 usecs/op/cpu
> >
> >     38.08%  [kernel]       [k] queued_spin_lock_slowpath
> >     10.81%  [kernel]       [k] smp_cfm_core_cond
> >      3.01%  [kernel]       [k] update_sg_lb_stats
> >      2.58%  [kernel]       [k] osq_lock
> >      2.57%  [kernel]       [k] llist_reverse_order
> >      1.45%  [kernel]       [k] find_next_bit
> >      1.21%  [kernel]       [k] flush_tlb_func_common
> >      1.01%  [kernel]       [k] arch_install_hw_breakpoint
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, that's an improvement of 12%
> > compared with the old O(#cpus) version.
> >
> > And finally, using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> >  |      Total time: 0.067 [sec]
> >  |
> >  |       35.292187 usecs/op
> >  |     2258.700000 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without the histogram
> > optimizations (baseline is 40 usecs/op). This is now on par with the
> > theoretical ideal (constraints disabled), and only 12% slower than no
> > breakpoints at all.
> >
> > Signed-off-by: Marco Elver <elver@google.com>
>
> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
>
> I don't see any bugs. But the code is quite complex. Does it make
> sense to add some asserts to the histogram type? E.g. counters don't
> underflow, weight is not negative (e.g. accidentally added -1 returned
> from task_bp_pinned()). Not sure if it will be enough to catch all
> types of bugs, though.
> Could kunit tests check that histograms are all 0's at the end?
>
> I am not just about the current code (which may be correct), but also
> future modifications to this code.

I'll think of some more options.

bp_slots_histogram_max*() already has asserts (WARN about underflow;
some with KCSAN help).

The main thing I did to raise my own confidence in the code is inject
bugs and see if the KUnit test catches it. If it didn't I extended the
tests. I'll do that some more maybe.

  reply	other threads:[~2022-06-28 16:01 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-28  9:58 [PATCH v2 00/13] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
2022-06-28  9:58 ` Marco Elver
2022-06-28  9:58 ` [PATCH v2 01/13] perf/hw_breakpoint: Add KUnit test for constraints accounting Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 12:53   ` Dmitry Vyukov
2022-06-28 12:53     ` Dmitry Vyukov
2022-06-28 13:26     ` Marco Elver
2022-06-28 13:26       ` Marco Elver
2022-06-28  9:58 ` [PATCH v2 02/13] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28  9:58 ` [PATCH v2 03/13] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 13:08   ` Dmitry Vyukov
2022-06-28 13:08     ` Dmitry Vyukov
2022-06-28 14:53     ` Marco Elver
2022-06-28 14:53       ` Marco Elver
2022-06-28 15:27       ` Dmitry Vyukov
2022-06-28 15:27         ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 04/13] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28  9:58 ` [PATCH v2 05/13] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28  9:58 ` [PATCH v2 06/13] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 13:16   ` Dmitry Vyukov
2022-06-28 13:16     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 07/13] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 13:18   ` Dmitry Vyukov
2022-06-28 13:18     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 13:21   ` Dmitry Vyukov
2022-06-28 13:21     ` Dmitry Vyukov
2022-07-01  8:54   ` Christophe Leroy
2022-07-01  8:54     ` Christophe Leroy
2022-07-01  9:41     ` Marco Elver
2022-07-01  9:41       ` Marco Elver
2022-07-01 10:15       ` Christophe Leroy
2022-07-01 10:15         ` Christophe Leroy
2022-06-28  9:58 ` [PATCH v2 09/13] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 14:44   ` Dmitry Vyukov
2022-06-28 14:44     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 10/13] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 14:45   ` Dmitry Vyukov
2022-06-28 14:45     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 11/13] perf/hw_breakpoint: Introduce bp_slots_histogram Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 14:52   ` Dmitry Vyukov
2022-06-28 14:52     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 12/13] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 15:41   ` Dmitry Vyukov
2022-06-28 15:41     ` Dmitry Vyukov
2022-06-28  9:58 ` [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() " Marco Elver
2022-06-28  9:58   ` Marco Elver
2022-06-28 10:54   ` Marco Elver
2022-06-28 10:54     ` Marco Elver
2022-06-28 15:45   ` Dmitry Vyukov
2022-06-28 15:45     ` Dmitry Vyukov
2022-06-28 16:00     ` Marco Elver [this message]
2022-06-28 16:00       ` Marco Elver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANpmjNOh9gzzC7sOOOk1q7Ssj2dFxczj1bmufarYS2KupZQthg@mail.gmail.com \
    --to=elver@google.com \
    --cc=acme@kernel.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=dvyukov@google.com \
    --cc=frederic@kernel.org \
    --cc=jolsa@redhat.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.