[PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks
@ 2022-06-09 11:30 Marco Elver
  2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

The hw_breakpoint subsystem's code has seen little change in over 10
years. In that time, systems with >100s of CPUs have become common,
along with improvements to the perf subsystem: using breakpoints on
thousands of concurrent tasks should be a supported usecase.

The breakpoint constraints accounting algorithm is the major bottleneck
in doing so:

  1. task_bp_pinned() has been O(#tasks), and called twice for each CPU.

  2. Everything is serialized on a global mutex, 'nr_bp_mutex'.

This series first optimizes task_bp_pinned() to only take O(1) on
average, and then reworks synchronization to allow concurrency when
checking and updating breakpoint constraints for tasks. Along the way,
smaller micro-optimizations and cleanups are done as they seemed obvious
when staring at the code (but likely insignificant).

The result is (on a system with 256 CPUs) that we go from:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
	 	[ ^ more aggressive benchmark parameters took too long ]
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

... to -- with all optimizations:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.071 [sec]
 |
 |       37.134896 usecs/op
 |     2376.633333 usecs/op/cpu

On the used test system, that's an effective speedup of ~3315x per op.

Which is close to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- for reference, constraints
accounting disabled:

 | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.067 [sec]
 |
 |       35.286458 usecs/op
 |     2258.333333 usecs/op/cpu

At this point, the current implementation is only ~5% slower than the
theoretical ideal. However, given constraints accounting cannot
realistically be disabled, this is likely as far as we can push it.

Marco Elver (8):
  perf/hw_breakpoint: Optimize list of per-task breakpoints
  perf/hw_breakpoint: Mark data __ro_after_init
  perf/hw_breakpoint: Optimize constant number of breakpoint slots
  perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  perf/hw_breakpoint: Remove useless code related to flexible
    breakpoints
  perf/hw_breakpoint: Reduce contention with large number of tasks
  perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  perf/hw_breakpoint: Clean up headers

 arch/sh/include/asm/hw_breakpoint.h  |   5 +-
 arch/x86/include/asm/hw_breakpoint.h |   5 +-
 include/linux/hw_breakpoint.h        |   1 -
 include/linux/perf_event.h           |   3 +-
 kernel/events/hw_breakpoint.c        | 374 +++++++++++++++++++--------
 5 files changed, 276 insertions(+), 112 deletions(-)

-- 
2.36.1.255.ge46751e96f-goog

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 12:30   ` Dmitry Vyukov
  2022-06-09 14:29   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

On a machine with 256 CPUs, running the recently added perf breakpoint
benchmark results in:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

The benchmark tests inherited breakpoint perf events across many
threads.

Looking at a perf profile, we can see that the majority of the time is
spent in various hw_breakpoint.c functions, which execute within the
'nr_bp_mutex' critical sections which then results in contention on that
mutex as well:

    37.27%  [kernel]       [k] osq_lock
    34.92%  [kernel]       [k] mutex_spin_on_owner
    12.15%  [kernel]       [k] toggle_bp_slot
    11.90%  [kernel]       [k] __reserve_bp_slot

The culprit here is task_bp_pinned(), which has a runtime complexity of
O(#tasks) due to storing all task breakpoints in the same list and
iterating through that list looking for a matching task. Clearly, this
does not scale to thousands of tasks.

While one option would be to make task_struct a breakpoint list node,
this would only further bloat task_struct for infrequently used data.

Instead, make use of the "rhashtable" variant "rhltable" which stores
multiple items with the same key in a list. This results in average
runtime complexity of O(1) for task_bp_pinned().

With the optimization, the benchmark shows:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.208 [sec]
 |
 |      108.422396 usecs/op
 |     6939.033333 usecs/op/cpu

On this particular setup that's a speedup of ~1135x.

Signed-off-by: Marco Elver <elver@google.com>
---
 include/linux/perf_event.h    |  3 +-
 kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 01231f1d976c..e27360436dc6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -36,6 +36,7 @@ struct perf_guest_info_callbacks {
 };
 
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
+#include <linux/rhashtable-types.h>
 #include <asm/hw_breakpoint.h>
 #endif
 
@@ -178,7 +179,7 @@ struct hw_perf_event {
 			 * creation and event initalization.
 			 */
 			struct arch_hw_breakpoint	info;
-			struct list_head		bp_list;
+			struct rhlist_head		bp_list;
 		};
 #endif
 		struct { /* amd_iommu */
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index f32320ac02fd..25c94c6e918d 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -28,7 +28,7 @@
 #include <linux/sched.h>
 #include <linux/init.h>
 #include <linux/slab.h>
-#include <linux/list.h>
+#include <linux/rhashtable.h>
 #include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -55,7 +55,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
 }
 
 /* Keep track of the breakpoints attached to tasks */
-static LIST_HEAD(bp_task_head);
+static struct rhltable task_bps_ht;
+static const struct rhashtable_params task_bps_ht_params = {
+	.head_offset = offsetof(struct hw_perf_event, bp_list),
+	.key_offset = offsetof(struct hw_perf_event, target),
+	.key_len = sizeof_field(struct hw_perf_event, target),
+	.automatic_shrinking = true,
+};
 
 static int constraints_initialized;
 
@@ -104,17 +110,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
  */
 static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
 {
-	struct task_struct *tsk = bp->hw.target;
+	struct rhlist_head *head, *pos;
 	struct perf_event *iter;
 	int count = 0;
 
-	list_for_each_entry(iter, &bp_task_head, hw.bp_list) {
-		if (iter->hw.target == tsk &&
-		    find_slot_idx(iter->attr.bp_type) == type &&
+	rcu_read_lock();
+	head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
+	if (!head)
+		goto out;
+
+	rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
+		if (find_slot_idx(iter->attr.bp_type) == type &&
 		    (iter->cpu < 0 || cpu == iter->cpu))
 			count += hw_breakpoint_weight(iter);
 	}
 
+out:
+	rcu_read_unlock();
 	return count;
 }
 
@@ -187,7 +199,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
 /*
  * Add/remove the given breakpoint in our constraint table
  */
-static void
+static int
 toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 	       int weight)
 {
@@ -200,7 +212,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 	/* Pinned counter cpu profiling */
 	if (!bp->hw.target) {
 		get_bp_info(bp->cpu, type)->cpu_pinned += weight;
-		return;
+		return 0;
 	}
 
 	/* Pinned counter task profiling */
@@ -208,9 +220,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 		toggle_bp_task_slot(bp, cpu, type, weight);
 
 	if (enable)
-		list_add_tail(&bp->hw.bp_list, &bp_task_head);
+		return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
 	else
-		list_del(&bp->hw.bp_list);
+		return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
 }
 
 __weak int arch_reserve_bp_slot(struct perf_event *bp)
@@ -308,9 +320,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 	if (ret)
 		return ret;
 
-	toggle_bp_slot(bp, true, type, weight);
-
-	return 0;
+	return toggle_bp_slot(bp, true, type, weight);
 }
 
 int reserve_bp_slot(struct perf_event *bp)
@@ -335,7 +345,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
 
 	type = find_slot_idx(bp_type);
 	weight = hw_breakpoint_weight(bp);
-	toggle_bp_slot(bp, false, type, weight);
+	WARN_ON(toggle_bp_slot(bp, false, type, weight));
 }
 
 void release_bp_slot(struct perf_event *bp)
@@ -679,7 +689,7 @@ static struct pmu perf_breakpoint = {
 int __init init_hw_breakpoint(void)
 {
 	int cpu, err_cpu;
-	int i;
+	int i, ret;
 
 	for (i = 0; i < TYPE_MAX; i++)
 		nr_slots[i] = hw_breakpoint_slots(i);
@@ -690,18 +700,24 @@ int __init init_hw_breakpoint(void)
 
 			info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
 							GFP_KERNEL);
-			if (!info->tsk_pinned)
-				goto err_alloc;
+			if (!info->tsk_pinned) {
+				ret = -ENOMEM;
+				goto err;
+			}
 		}
 	}
 
+	ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
+	if (ret)
+		goto err;
+
 	constraints_initialized = 1;
 
 	perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
 
 	return register_die_notifier(&hw_breakpoint_exceptions_nb);
 
- err_alloc:
+err:
 	for_each_possible_cpu(err_cpu) {
 		for (i = 0; i < TYPE_MAX; i++)
 			kfree(get_bp_info(err_cpu, i)->tsk_pinned);
@@ -709,7 +725,5 @@ int __init init_hw_breakpoint(void)
 			break;
 	}
 
-	return -ENOMEM;
+	return ret;
 }
-
-
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
  2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 11:45   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Mark read-only data after initialization as __ro_after_init.

While we are here, turn 'constraints_initialized' into a bool.

Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/events/hw_breakpoint.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 25c94c6e918d..1f718745d569 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -47,7 +47,7 @@ struct bp_cpuinfo {
 };
 
 static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
-static int nr_slots[TYPE_MAX];
+static int nr_slots[TYPE_MAX] __ro_after_init;
 
 static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
 {
@@ -63,7 +63,7 @@ static const struct rhashtable_params task_bps_ht_params = {
 	.automatic_shrinking = true,
 };
 
-static int constraints_initialized;
+static bool constraints_initialized __ro_after_init;
 
 /* Gather the number of total pinned and un-pinned bp in a cpuset */
 struct bp_busy_slots {
@@ -711,7 +711,7 @@ int __init init_hw_breakpoint(void)
 	if (ret)
 		goto err;
 
-	constraints_initialized = 1;
+	constraints_initialized = true;
 
 	perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
 
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
  2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
  2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 11:55   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Optimize internal hw_breakpoint state if the architecture's number of
breakpoint slots is constant. This avoids several kmalloc() calls and
potentially unnecessary failures if the allocations fail, as well as
subtly improves code generation and cache locality.

The protocol is that if an architecture defines hw_breakpoint_slots via
the preprocessor, it must be constant and the same for all types.

Signed-off-by: Marco Elver <elver@google.com>
---
 arch/sh/include/asm/hw_breakpoint.h  |  5 +-
 arch/x86/include/asm/hw_breakpoint.h |  5 +-
 kernel/events/hw_breakpoint.c        | 92 ++++++++++++++++++----------
 3 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/arch/sh/include/asm/hw_breakpoint.h b/arch/sh/include/asm/hw_breakpoint.h
index 199d17b765f2..361a0f57bdeb 100644
--- a/arch/sh/include/asm/hw_breakpoint.h
+++ b/arch/sh/include/asm/hw_breakpoint.h
@@ -48,10 +48,7 @@ struct pmu;
 /* Maximum number of UBC channels */
 #define HBP_NUM		2
 
-static inline int hw_breakpoint_slots(int type)
-{
-	return HBP_NUM;
-}
+#define hw_breakpoint_slots(type) (HBP_NUM)
 
 /* arch/sh/kernel/hw_breakpoint.c */
 extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index a1f0e90d0818..0bc931cd0698 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -44,10 +44,7 @@ struct arch_hw_breakpoint {
 /* Total number of available HW breakpoint registers */
 #define HBP_NUM 4
 
-static inline int hw_breakpoint_slots(int type)
-{
-	return HBP_NUM;
-}
+#define hw_breakpoint_slots(type) (HBP_NUM)
 
 struct perf_event_attr;
 struct perf_event;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 1f718745d569..8e939723f27d 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -41,13 +41,16 @@ struct bp_cpuinfo {
 	/* Number of pinned cpu breakpoints in a cpu */
 	unsigned int	cpu_pinned;
 	/* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
+#ifdef hw_breakpoint_slots
+	unsigned int	tsk_pinned[hw_breakpoint_slots(0)];
+#else
 	unsigned int	*tsk_pinned;
+#endif
 	/* Number of non-pinned cpu/task breakpoints in a cpu */
 	unsigned int	flexible; /* XXX: placeholder, see fetch_this_slot() */
 };
 
 static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
-static int nr_slots[TYPE_MAX] __ro_after_init;
 
 static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
 {
@@ -74,6 +77,54 @@ struct bp_busy_slots {
 /* Serialize accesses to the above constraints */
 static DEFINE_MUTEX(nr_bp_mutex);
 
+#ifdef hw_breakpoint_slots
+/*
+ * Number of breakpoint slots is constant, and the same for all types.
+ */
+static_assert(hw_breakpoint_slots(TYPE_INST) == hw_breakpoint_slots(TYPE_DATA));
+static inline int hw_breakpoint_slots_cached(int type)	{ return hw_breakpoint_slots(type); }
+static inline int init_breakpoint_slots(void)		{ return 0; }
+#else
+/*
+ * Dynamic number of breakpoint slots.
+ */
+static int __nr_bp_slots[TYPE_MAX] __ro_after_init;
+
+static inline int hw_breakpoint_slots_cached(int type)
+{
+	return __nr_bp_slots[type];
+}
+
+static __init int init_breakpoint_slots(void)
+{
+	int i, cpu, err_cpu;
+
+	for (i = 0; i < TYPE_MAX; i++)
+		__nr_bp_slots[i] = hw_breakpoint_slots(i);
+
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < TYPE_MAX; i++) {
+			struct bp_cpuinfo *info = get_bp_info(cpu, i);
+
+			info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
+			if (!info->tsk_pinned)
+				goto err;
+		}
+	}
+
+	return 0;
+err:
+	for_each_possible_cpu(err_cpu) {
+		for (i = 0; i < TYPE_MAX; i++)
+			kfree(get_bp_info(err_cpu, i)->tsk_pinned);
+		if (err_cpu == cpu)
+			break;
+	}
+
+	return -ENOMEM;
+}
+#endif
+
 __weak int hw_breakpoint_weight(struct perf_event *bp)
 {
 	return 1;
@@ -96,7 +147,7 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
 	unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
 	int i;
 
-	for (i = nr_slots[type] - 1; i >= 0; i--) {
+	for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
 		if (tsk_pinned[i] > 0)
 			return i + 1;
 	}
@@ -313,7 +364,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 	fetch_this_slot(&slots, weight);
 
 	/* Flexible counters need to keep at least one slot */
-	if (slots.pinned + (!!slots.flexible) > nr_slots[type])
+	if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
 		return -ENOSPC;
 
 	ret = arch_reserve_bp_slot(bp);
@@ -688,42 +739,19 @@ static struct pmu perf_breakpoint = {
 
 int __init init_hw_breakpoint(void)
 {
-	int cpu, err_cpu;
-	int i, ret;
-
-	for (i = 0; i < TYPE_MAX; i++)
-		nr_slots[i] = hw_breakpoint_slots(i);
-
-	for_each_possible_cpu(cpu) {
-		for (i = 0; i < TYPE_MAX; i++) {
-			struct bp_cpuinfo *info = get_bp_info(cpu, i);
-
-			info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
-							GFP_KERNEL);
-			if (!info->tsk_pinned) {
-				ret = -ENOMEM;
-				goto err;
-			}
-		}
-	}
+	int ret;
 
 	ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
 	if (ret)
-		goto err;
+		return ret;
+
+	ret = init_breakpoint_slots();
+	if (ret)
+		return ret;
 
 	constraints_initialized = true;
 
 	perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
 
 	return register_die_notifier(&hw_breakpoint_exceptions_nb);
-
-err:
-	for_each_possible_cpu(err_cpu) {
-		for (i = 0; i < TYPE_MAX; i++)
-			kfree(get_bp_info(err_cpu, i)->tsk_pinned);
-		if (err_cpu == cpu)
-			break;
-	}
-
-	return ret;
 }
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (2 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 12:03   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Due to being a __weak function, hw_breakpoint_weight() will cause the
compiler to always emit a call to it. This generates unnecessarily bad
code (register spills etc.) for no good reason; in fact it appears in
profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:

    ...
    0.70%  [kernel]       [k] hw_breakpoint_weight
    ...

While a small percentage, no architecture defines its own
hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
which makes the fact it is currently __weak a poor choice.

Change hw_breakpoint_weight()'s definition to follow a similar protocol
to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
hw_breakpoint_weight(), we'll use it instead.

The result is that it is inlined and no longer shows up in profiles.

Signed-off-by: Marco Elver <elver@google.com>
---
 include/linux/hw_breakpoint.h | 1 -
 kernel/events/hw_breakpoint.c | 4 +++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
index 78dd7035d1e5..9fa3547acd87 100644
--- a/include/linux/hw_breakpoint.h
+++ b/include/linux/hw_breakpoint.h
@@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
 extern int dbg_release_bp_slot(struct perf_event *bp);
 extern int reserve_bp_slot(struct perf_event *bp);
 extern void release_bp_slot(struct perf_event *bp);
-int hw_breakpoint_weight(struct perf_event *bp);
 int arch_reserve_bp_slot(struct perf_event *bp);
 void arch_release_bp_slot(struct perf_event *bp);
 void arch_unregister_hw_breakpoint(struct perf_event *bp);
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 8e939723f27d..5f40c8dfa042 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -125,10 +125,12 @@ static __init int init_breakpoint_slots(void)
 }
 #endif
 
-__weak int hw_breakpoint_weight(struct perf_event *bp)
+#ifndef hw_breakpoint_weight
+static inline int hw_breakpoint_weight(struct perf_event *bp)
 {
 	return 1;
 }
+#endif
 
 static inline enum bp_type_idx find_slot_idx(u64 bp_type)
 {
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (3 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 12:04   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Flexible breakpoints have never been implemented, with
bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
flexible count in fetch_bp_busy_slots().

This again causes suboptimal code generation, when we always know that
`!!slots.flexible` will be 0.

Just get rid of the flexible "placeholder" and remove all real code
related to it. Make a note in the comment related to the constraints
algorithm but don't remove them from the algorithm, so that if in future
flexible breakpoints need supporting, it should be trivial to revive
them (along with reverting this change).

Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/events/hw_breakpoint.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 5f40c8dfa042..afe0a6007e96 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -46,8 +46,6 @@ struct bp_cpuinfo {
 #else
 	unsigned int	*tsk_pinned;
 #endif
-	/* Number of non-pinned cpu/task breakpoints in a cpu */
-	unsigned int	flexible; /* XXX: placeholder, see fetch_this_slot() */
 };
 
 static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
@@ -71,7 +69,6 @@ static bool constraints_initialized __ro_after_init;
 /* Gather the number of total pinned and un-pinned bp in a cpuset */
 struct bp_busy_slots {
 	unsigned int pinned;
-	unsigned int flexible;
 };
 
 /* Serialize accesses to the above constraints */
@@ -213,10 +210,6 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp,
 
 		if (nr > slots->pinned)
 			slots->pinned = nr;
-
-		nr = info->flexible;
-		if (nr > slots->flexible)
-			slots->flexible = nr;
 	}
 }
 
@@ -299,7 +292,8 @@ __weak void arch_unregister_hw_breakpoint(struct perf_event *bp)
 }
 
 /*
- * Constraints to check before allowing this new breakpoint counter:
+ * Constraints to check before allowing this new breakpoint counter. Note that
+ * flexible breakpoints are currently unsupported -- see fetch_this_slot().
  *
  *  == Non-pinned counter == (Considered as pinned for now)
  *
@@ -366,7 +360,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 	fetch_this_slot(&slots, weight);
 
 	/* Flexible counters need to keep at least one slot */
-	if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
+	if (slots.pinned > hw_breakpoint_slots_cached(type))
 		return -ENOSPC;
 
 	ret = arch_reserve_bp_slot(bp);
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (4 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 13:03   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

While optimizing task_bp_pinned()'s runtime complexity to O(1) on
average helps reduce time spent in the critical section, we still suffer
due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
that now contention is the biggest issue:

    95.93%  [kernel]       [k] osq_lock
     0.70%  [kernel]       [k] mutex_spin_on_owner
     0.22%  [kernel]       [k] smp_cfm_core_cond
     0.18%  [kernel]       [k] task_bp_pinned
     0.18%  [kernel]       [k] rhashtable_jhash2
     0.15%  [kernel]       [k] queued_spin_lock_slowpath

when running the breakpoint benchmark with (system with 256 CPUs):

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.207 [sec]
 |
 |      108.267188 usecs/op
 |     6929.100000 usecs/op/cpu

The main concern for synchronizing the breakpoint constraints data is
that a consistent snapshot of the per-CPU and per-task data is observed.

The access pattern is as follows:

 1. If the target is a task: the task's pinned breakpoints are counted,
    checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
    is used to check for conflicts with CPU-only breakpoints;
    bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
    unused.

 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
    with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
    incremented. No per-task breakpoints are checked.

Since rhltable safely synchronizes insertions/deletions, we can allow
concurrency as follows:

 1. If the target is a task: independent tasks may update and check the
    constraints concurrently, but same-task target calls need to be
    serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
    checked, these modifications can happen concurrently by switching
    tsk_pinned to atomic_t.

 2. If the target is a CPU: access to the per-CPU constraints needs to
    be serialized with other CPU-target and task-target callers (to
    stabilize the bp_cpuinfo::tsk_pinned snapshot).

We can allow the above concurrency by introducing a per-CPU constraints
data reader-writer lock (bp_cpuinfo_lock), and per-task mutexes
(task_sharded_mtx):

  1. If the target is a task: acquires its task_sharded_mtx, and
     acquires bp_cpuinfo_lock as a reader.

  2. If the target is a CPU: acquires bp_cpuinfo_lock as a writer.

With these changes, contention with thousands of tasks is reduced to the
point where waiting on locking no longer dominates the profile:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.080 [sec]
 |
 |       42.048437 usecs/op
 |     2691.100000 usecs/op/cpu

    21.31%  [kernel]       [k] task_bp_pinned
    17.49%  [kernel]       [k] rhashtable_jhash2
     5.29%  [kernel]       [k] toggle_bp_slot
     4.45%  [kernel]       [k] mutex_spin_on_owner
     3.72%  [kernel]       [k] bcmp

On this particular setup that's a speedup of 2.5x.

We're also getting closer to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- constraints accounting disabled:

 | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.067 [sec]
 |
 |       35.286458 usecs/op
 |     2258.333333 usecs/op/cpu

Which means the current implementation is ~19% slower than the
theoretical ideal.

For reference, performance without any breakpoints:

 | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
 |      Total time: 0.060 [sec]
 |
 |       31.365625 usecs/op
 |     2007.400000 usecs/op/cpu

The theoretical ideal is only ~12% slower than no breakpoints at all.
The current implementation is ~34% slower than no breakpoints at all.
(On a system with 256 CPUs.)

Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/events/hw_breakpoint.c | 155 ++++++++++++++++++++++++++++------
 1 file changed, 128 insertions(+), 27 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index afe0a6007e96..08c9ed0626e4 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -17,6 +17,7 @@
  * This file contains the arch-independent routines.
  */
 
+#include <linux/atomic.h>
 #include <linux/irqflags.h>
 #include <linux/kallsyms.h>
 #include <linux/notifier.h>
@@ -24,8 +25,10 @@
 #include <linux/kdebug.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/mutex.h>
 #include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/spinlock.h>
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/rhashtable.h>
@@ -42,9 +45,9 @@ struct bp_cpuinfo {
 	unsigned int	cpu_pinned;
 	/* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
 #ifdef hw_breakpoint_slots
-	unsigned int	tsk_pinned[hw_breakpoint_slots(0)];
+	atomic_t	tsk_pinned[hw_breakpoint_slots(0)];
 #else
-	unsigned int	*tsk_pinned;
+	atomic_t	*tsk_pinned;
 #endif
 };
 
@@ -71,8 +74,81 @@ struct bp_busy_slots {
 	unsigned int pinned;
 };
 
-/* Serialize accesses to the above constraints */
-static DEFINE_MUTEX(nr_bp_mutex);
+/*
+ * Synchronizes accesses to the per-CPU constraints; users of data in bp_cpuinfo
+ * must acquire bp_cpuinfo_lock as writer to get a stable snapshot of all CPUs'
+ * constraints. Modifications without use may only acquire bp_cpuinfo_lock as a
+ * reader, but must otherwise ensure modifications are never lost.
+ */
+static DEFINE_RWLOCK(bp_cpuinfo_lock);
+
+/*
+ * Synchronizes accesses to the per-task breakpoint list in task_bps_ht. Since
+ * rhltable synchronizes concurrent insertions/deletions, independent tasks may
+ * insert/delete concurrently; therefore, a mutex per task would be sufficient.
+ *
+ * To avoid bloating task_struct with infrequently used data, use a sharded
+ * mutex that scales with number of CPUs.
+ */
+static DEFINE_PER_CPU(struct mutex, task_sharded_mtx);
+
+static struct mutex *get_task_sharded_mtx(struct perf_event *bp)
+{
+	int shard;
+
+	if (!bp->hw.target)
+		return NULL;
+
+	/*
+	 * Compute a valid shard index into per-CPU data.
+	 */
+	shard = task_pid_nr(bp->hw.target) % nr_cpu_ids;
+	shard = cpumask_next(shard - 1, cpu_possible_mask);
+	if (shard >= nr_cpu_ids)
+		shard = cpumask_first(cpu_possible_mask);
+
+	return per_cpu_ptr(&task_sharded_mtx, shard);
+}
+
+static struct mutex *bp_constraints_lock(struct perf_event *bp)
+{
+	struct mutex *mtx = get_task_sharded_mtx(bp);
+
+	if (mtx) {
+		mutex_lock(mtx);
+		read_lock(&bp_cpuinfo_lock);
+	} else {
+		write_lock(&bp_cpuinfo_lock);
+	}
+
+	return mtx;
+}
+
+static void bp_constraints_unlock(struct mutex *mtx)
+{
+	if (mtx) {
+		read_unlock(&bp_cpuinfo_lock);
+		mutex_unlock(mtx);
+	} else {
+		write_unlock(&bp_cpuinfo_lock);
+	}
+}
+
+static bool bp_constraints_is_locked(struct perf_event *bp)
+{
+	struct mutex *mtx = get_task_sharded_mtx(bp);
+
+	return (mtx ? mutex_is_locked(mtx) : false) ||
+	       rwlock_is_contended(&bp_cpuinfo_lock);
+}
+
+static inline void assert_bp_constraints_lock_held(struct perf_event *bp)
+{
+	lockdep_assert_held(&bp_cpuinfo_lock);
+	/* Don't call get_task_sharded_mtx() if lockdep is disabled. */
+	if (IS_ENABLED(CONFIG_LOCKDEP) && bp->hw.target)
+		lockdep_assert_held(get_task_sharded_mtx(bp));
+}
 
 #ifdef hw_breakpoint_slots
 /*
@@ -103,7 +179,7 @@ static __init int init_breakpoint_slots(void)
 		for (i = 0; i < TYPE_MAX; i++) {
 			struct bp_cpuinfo *info = get_bp_info(cpu, i);
 
-			info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
+			info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(atomic_t), GFP_KERNEL);
 			if (!info->tsk_pinned)
 				goto err;
 		}
@@ -143,11 +219,19 @@ static inline enum bp_type_idx find_slot_idx(u64 bp_type)
  */
 static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
 {
-	unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
+	atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
 	int i;
 
+	/*
+	 * At this point we want to have acquired the bp_cpuinfo_lock as a
+	 * writer to ensure that there are no concurrent writers in
+	 * toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot.
+	 */
+	lockdep_assert_held_write(&bp_cpuinfo_lock);
+
 	for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
-		if (tsk_pinned[i] > 0)
+		ASSERT_EXCLUSIVE_WRITER(tsk_pinned[i]); /* Catch unexpected writers. */
+		if (atomic_read(&tsk_pinned[i]) > 0)
 			return i + 1;
 	}
 
@@ -164,6 +248,11 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
 	struct perf_event *iter;
 	int count = 0;
 
+	/*
+	 * We need a stable snapshot of the per-task breakpoint list.
+	 */
+	assert_bp_constraints_lock_held(bp);
+
 	rcu_read_lock();
 	head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
 	if (!head)
@@ -230,16 +319,25 @@ fetch_this_slot(struct bp_busy_slots *slots, int weight)
 static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
 				enum bp_type_idx type, int weight)
 {
-	unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
+	atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
 	int old_idx, new_idx;
 
+	/*
+	 * If bp->hw.target, tsk_pinned is only modified, but not used
+	 * otherwise. We can permit concurrent updates as long as there are no
+	 * other uses: having acquired bp_cpuinfo_lock as a reader allows
+	 * concurrent updates here. Uses of tsk_pinned will require acquiring
+	 * bp_cpuinfo_lock as a writer to stabilize tsk_pinned's value.
+	 */
+	lockdep_assert_held_read(&bp_cpuinfo_lock);
+
 	old_idx = task_bp_pinned(cpu, bp, type) - 1;
 	new_idx = old_idx + weight;
 
 	if (old_idx >= 0)
-		tsk_pinned[old_idx]--;
+		atomic_dec(&tsk_pinned[old_idx]);
 	if (new_idx >= 0)
-		tsk_pinned[new_idx]++;
+		atomic_inc(&tsk_pinned[new_idx]);
 }
 
 /*
@@ -257,6 +355,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 
 	/* Pinned counter cpu profiling */
 	if (!bp->hw.target) {
+		lockdep_assert_held_write(&bp_cpuinfo_lock);
 		get_bp_info(bp->cpu, type)->cpu_pinned += weight;
 		return 0;
 	}
@@ -265,6 +364,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 	for_each_cpu(cpu, cpumask)
 		toggle_bp_task_slot(bp, cpu, type, weight);
 
+	/*
+	 * Readers want a stable snapshot of the per-task breakpoint list.
+	 */
+	assert_bp_constraints_lock_held(bp);
+
 	if (enable)
 		return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
 	else
@@ -372,14 +476,10 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 
 int reserve_bp_slot(struct perf_event *bp)
 {
-	int ret;
-
-	mutex_lock(&nr_bp_mutex);
-
-	ret = __reserve_bp_slot(bp, bp->attr.bp_type);
-
-	mutex_unlock(&nr_bp_mutex);
+	struct mutex *mtx = bp_constraints_lock(bp);
+	int ret = __reserve_bp_slot(bp, bp->attr.bp_type);
 
+	bp_constraints_unlock(mtx);
 	return ret;
 }
 
@@ -397,12 +497,11 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
 
 void release_bp_slot(struct perf_event *bp)
 {
-	mutex_lock(&nr_bp_mutex);
+	struct mutex *mtx = bp_constraints_lock(bp);
 
 	arch_unregister_hw_breakpoint(bp);
 	__release_bp_slot(bp, bp->attr.bp_type);
-
-	mutex_unlock(&nr_bp_mutex);
+	bp_constraints_unlock(mtx);
 }
 
 static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
@@ -429,11 +528,10 @@ static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
 
 static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
 {
-	int ret;
+	struct mutex *mtx = bp_constraints_lock(bp);
+	int ret = __modify_bp_slot(bp, old_type, new_type);
 
-	mutex_lock(&nr_bp_mutex);
-	ret = __modify_bp_slot(bp, old_type, new_type);
-	mutex_unlock(&nr_bp_mutex);
+	bp_constraints_unlock(mtx);
 	return ret;
 }
 
@@ -444,7 +542,7 @@ static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
  */
 int dbg_reserve_bp_slot(struct perf_event *bp)
 {
-	if (mutex_is_locked(&nr_bp_mutex))
+	if (bp_constraints_is_locked(bp))
 		return -1;
 
 	return __reserve_bp_slot(bp, bp->attr.bp_type);
@@ -452,7 +550,7 @@ int dbg_reserve_bp_slot(struct perf_event *bp)
 
 int dbg_release_bp_slot(struct perf_event *bp)
 {
-	if (mutex_is_locked(&nr_bp_mutex))
+	if (bp_constraints_is_locked(bp))
 		return -1;
 
 	__release_bp_slot(bp, bp->attr.bp_type);
@@ -735,7 +833,10 @@ static struct pmu perf_breakpoint = {
 
 int __init init_hw_breakpoint(void)
 {
-	int ret;
+	int cpu, ret;
+
+	for_each_possible_cpu(cpu)
+		mutex_init(&per_cpu(task_sharded_mtx, cpu));
 
 	ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
 	if (ret)
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (5 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 15:00   ` Dmitry Vyukov
  2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
  2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Running the perf benchmark with (note: more aggressive parameters vs.
preceding changes, but same host with 256 CPUs):

 | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
 |      Total time: 1.953 [sec]
 |
 |       38.146289 usecs/op
 |     4882.725000 usecs/op/cpu

    16.29%  [kernel]       [k] rhashtable_jhash2
    16.19%  [kernel]       [k] osq_lock
    14.22%  [kernel]       [k] queued_spin_lock_slowpath
     8.58%  [kernel]       [k] task_bp_pinned
     8.30%  [kernel]       [k] mutex_spin_on_owner
     4.03%  [kernel]       [k] smp_cfm_core_cond
     2.97%  [kernel]       [k] toggle_bp_slot
     2.94%  [kernel]       [k] bcmp

We can see that a majority of the time is now spent hashing task
pointers to index into task_bps_ht in task_bp_pinned().

However, if task_bp_pinned()'s computation is independent of any CPU,
i.e. always `iter->cpu < 0`, the result for each invocation will be
identical. With increasing CPU-count, this problem worsens.

Instead, identify if every call to task_bp_pinned() is CPU-independent,
and cache the result. Use the cached result instead of a call to
task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
if the cached result can be used.

After this optimization:

    21.96%  [kernel]       [k] queued_spin_lock_slowpath
    16.39%  [kernel]       [k] osq_lock
     9.82%  [kernel]       [k] toggle_bp_slot
     9.81%  [kernel]       [k] find_next_bit
     4.93%  [kernel]       [k] mutex_spin_on_owner
     4.71%  [kernel]       [k] smp_cfm_core_cond
     4.30%  [kernel]       [k] __reserve_bp_slot
     2.65%  [kernel]       [k] cpumask_next

Showing that the time spent hashing keys has become insignificant.

With the given benchmark parameters, however, we see no statistically
significant improvement in performance on the test system with 256 CPUs.
This is very likely due to the benchmark parameters being too aggressive
and contention elsewhere becoming dominant.

Indeed, when using the less aggressive parameters from the preceding
changes, we now observe:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.071 [sec]
 |
 |       37.134896 usecs/op
 |     2376.633333 usecs/op/cpu

Which is an improvement of 12% compared to without this optimization
(baseline is 42 usecs/op). This is now only 5% slower than the
theoretical ideal (constraints disabled), and 18% slower than no
breakpoints at all.

[ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
  more consistent with other functions (which have bp first, before the
  cpu argument). ]

Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/events/hw_breakpoint.c | 71 +++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 19 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 08c9ed0626e4..3b33a4075104 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -242,11 +242,22 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
  * Count the number of breakpoints of the same type and same task.
  * The given event must be not on the list.
  */
-static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
+struct task_bp_pinned {
+	/*
+	 * If @cpu_independent is true, we can avoid calling __task_bp_pinned()
+	 * for each CPU, since @count will be the same for each invocation.
+	 */
+	bool cpu_independent;
+	int count;
+	struct perf_event *bp;
+	enum bp_type_idx type;
+};
+static struct task_bp_pinned
+__task_bp_pinned(struct perf_event *bp, int cpu, enum bp_type_idx type)
 {
+	struct task_bp_pinned ret = {true, 0, bp, type};
 	struct rhlist_head *head, *pos;
 	struct perf_event *iter;
-	int count = 0;
 
 	/*
 	 * We need a stable snapshot of the per-task breakpoint list.
@@ -259,14 +270,33 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
 		goto out;
 
 	rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
-		if (find_slot_idx(iter->attr.bp_type) == type &&
-		    (iter->cpu < 0 || cpu == iter->cpu))
-			count += hw_breakpoint_weight(iter);
+		if (find_slot_idx(iter->attr.bp_type) == type) {
+			if (iter->cpu >= 0) {
+				ret.cpu_independent = false;
+				if (cpu != iter->cpu)
+					continue;
+			}
+			ret.count += hw_breakpoint_weight(iter);
+		}
 	}
 
 out:
 	rcu_read_unlock();
-	return count;
+	return ret;
+}
+
+static int
+task_bp_pinned(struct perf_event *bp, int cpu, enum bp_type_idx type,
+	       struct task_bp_pinned *cached_tbp_pinned)
+{
+	if (cached_tbp_pinned->cpu_independent) {
+		assert_bp_constraints_lock_held(bp);
+		if (!WARN_ON(cached_tbp_pinned->bp != bp || cached_tbp_pinned->type != type))
+			return cached_tbp_pinned->count;
+	}
+
+	*cached_tbp_pinned = __task_bp_pinned(bp, cpu, type);
+	return cached_tbp_pinned->count;
 }
 
 static const struct cpumask *cpumask_of_bp(struct perf_event *bp)
@@ -281,8 +311,8 @@ static const struct cpumask *cpumask_of_bp(struct perf_event *bp)
  * a given cpu (cpu > -1) or in all of them (cpu = -1).
  */
 static void
-fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp,
-		    enum bp_type_idx type)
+fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp, enum bp_type_idx type,
+		    struct task_bp_pinned *cached_tbp_pinned)
 {
 	const struct cpumask *cpumask = cpumask_of_bp(bp);
 	int cpu;
@@ -295,7 +325,7 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp,
 		if (!bp->hw.target)
 			nr += max_task_bp_pinned(cpu, type);
 		else
-			nr += task_bp_pinned(cpu, bp, type);
+			nr += task_bp_pinned(bp, cpu, type, cached_tbp_pinned);
 
 		if (nr > slots->pinned)
 			slots->pinned = nr;
@@ -314,10 +344,11 @@ fetch_this_slot(struct bp_busy_slots *slots, int weight)
 }
 
 /*
- * Add a pinned breakpoint for the given task in our constraint table
+ * Add a pinned breakpoint for the given task in our constraint table.
  */
-static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
-				enum bp_type_idx type, int weight)
+static void
+toggle_bp_task_slot(struct perf_event *bp, int cpu, enum bp_type_idx type, int weight,
+		    struct task_bp_pinned *cached_tbp_pinned)
 {
 	atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
 	int old_idx, new_idx;
@@ -331,7 +362,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
 	 */
 	lockdep_assert_held_read(&bp_cpuinfo_lock);
 
-	old_idx = task_bp_pinned(cpu, bp, type) - 1;
+	old_idx = task_bp_pinned(bp, cpu, type, cached_tbp_pinned) - 1;
 	new_idx = old_idx + weight;
 
 	if (old_idx >= 0)
@@ -341,11 +372,11 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
 }
 
 /*
- * Add/remove the given breakpoint in our constraint table
+ * Add/remove the given breakpoint in our constraint table.
  */
 static int
 toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
-	       int weight)
+	       int weight, struct task_bp_pinned *cached_tbp_pinned)
 {
 	const struct cpumask *cpumask = cpumask_of_bp(bp);
 	int cpu;
@@ -362,7 +393,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
 
 	/* Pinned counter task profiling */
 	for_each_cpu(cpu, cpumask)
-		toggle_bp_task_slot(bp, cpu, type, weight);
+		toggle_bp_task_slot(bp, cpu, type, weight, cached_tbp_pinned);
 
 	/*
 	 * Readers want a stable snapshot of the per-task breakpoint list.
@@ -439,6 +470,7 @@ __weak void arch_unregister_hw_breakpoint(struct perf_event *bp)
  */
 static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 {
+	struct task_bp_pinned cached_tbp_pinned = {};
 	struct bp_busy_slots slots = {0};
 	enum bp_type_idx type;
 	int weight;
@@ -456,7 +488,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 	type = find_slot_idx(bp_type);
 	weight = hw_breakpoint_weight(bp);
 
-	fetch_bp_busy_slots(&slots, bp, type);
+	fetch_bp_busy_slots(&slots, bp, type, &cached_tbp_pinned);
 	/*
 	 * Simulate the addition of this breakpoint to the constraints
 	 * and see the result.
@@ -471,7 +503,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
 	if (ret)
 		return ret;
 
-	return toggle_bp_slot(bp, true, type, weight);
+	return toggle_bp_slot(bp, true, type, weight, &cached_tbp_pinned);
 }
 
 int reserve_bp_slot(struct perf_event *bp)
@@ -485,6 +517,7 @@ int reserve_bp_slot(struct perf_event *bp)
 
 static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
 {
+	struct task_bp_pinned cached_tbp_pinned = {};
 	enum bp_type_idx type;
 	int weight;
 
@@ -492,7 +525,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
 
 	type = find_slot_idx(bp_type);
 	weight = hw_breakpoint_weight(bp);
-	WARN_ON(toggle_bp_slot(bp, false, type, weight));
+	WARN_ON(toggle_bp_slot(bp, false, type, weight, &cached_tbp_pinned));
 }
 
 void release_bp_slot(struct perf_event *bp)
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] perf/hw_breakpoint: Clean up headers
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (6 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
@ 2022-06-09 11:30 ` Marco Elver
  2022-06-09 12:11   ` Dmitry Vyukov
  2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov
  8 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 11:30 UTC (permalink / raw)
  To: elver, Peter Zijlstra, Frederic Weisbecker, Ingo Molnar
  Cc: Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Dmitry Vyukov,
	linux-perf-users, x86, linux-sh, kasan-dev, linux-kernel

Clean up headers:

 - Remove unused <linux/kallsyms.h>

 - Remove unused <linux/kprobes.h>

 - Remove unused <linux/module.h>

 - Remove unused <linux/smp.h>

 - Add <linux/export.h> for EXPORT_SYMBOL_GPL().

 - Sort alphabetically.

 - Move <linux/hw_breakpoint.h> to top to test it compiles on its own.

Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/events/hw_breakpoint.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 3b33a4075104..e9aa7f2c031a 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -17,26 +17,24 @@
  * This file contains the arch-independent routines.
  */
 
+#include <linux/hw_breakpoint.h>
+
 #include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/cpu.h>
+#include <linux/export.h>
+#include <linux/init.h>
 #include <linux/irqflags.h>
-#include <linux/kallsyms.h>
-#include <linux/notifier.h>
-#include <linux/kprobes.h>
 #include <linux/kdebug.h>
 #include <linux/kernel.h>
-#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
 #include <linux/percpu.h>
+#include <linux/rhashtable.h>
 #include <linux/sched.h>
-#include <linux/spinlock.h>
-#include <linux/init.h>
 #include <linux/slab.h>
-#include <linux/rhashtable.h>
-#include <linux/cpu.h>
-#include <linux/smp.h>
-#include <linux/bug.h>
+#include <linux/spinlock.h>
 
-#include <linux/hw_breakpoint.h>
 /*
  * Constraints data
  */
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init
  2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
@ 2022-06-09 11:45   ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 11:45 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Mark read-only data after initialization as __ro_after_init.
>
> While we are here, turn 'constraints_initialized' into a bool.
>
> Signed-off-by: Marco Elver <elver@google.com>

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

> ---
>  kernel/events/hw_breakpoint.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 25c94c6e918d..1f718745d569 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -47,7 +47,7 @@ struct bp_cpuinfo {
>  };
>
>  static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
> -static int nr_slots[TYPE_MAX];
> +static int nr_slots[TYPE_MAX] __ro_after_init;
>
>  static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
>  {
> @@ -63,7 +63,7 @@ static const struct rhashtable_params task_bps_ht_params = {
>         .automatic_shrinking = true,
>  };
>
> -static int constraints_initialized;
> +static bool constraints_initialized __ro_after_init;
>
>  /* Gather the number of total pinned and un-pinned bp in a cpuset */
>  struct bp_busy_slots {
> @@ -711,7 +711,7 @@ int __init init_hw_breakpoint(void)
>         if (ret)
>                 goto err;
>
> -       constraints_initialized = 1;
> +       constraints_initialized = true;
>
>         perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
>
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots
  2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
@ 2022-06-09 11:55   ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 11:55 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Optimize internal hw_breakpoint state if the architecture's number of
> breakpoint slots is constant. This avoids several kmalloc() calls and
> potentially unnecessary failures if the allocations fail, as well as
> subtly improves code generation and cache locality.
>
> The protocol is that if an architecture defines hw_breakpoint_slots via
> the preprocessor, it must be constant and the same for all types.
>
> Signed-off-by: Marco Elver <elver@google.com>

Acked-by: Dmitry Vyukov <dvyukov@google.com>

> ---
>  arch/sh/include/asm/hw_breakpoint.h  |  5 +-
>  arch/x86/include/asm/hw_breakpoint.h |  5 +-
>  kernel/events/hw_breakpoint.c        | 92 ++++++++++++++++++----------
>  3 files changed, 62 insertions(+), 40 deletions(-)
>
> diff --git a/arch/sh/include/asm/hw_breakpoint.h b/arch/sh/include/asm/hw_breakpoint.h
> index 199d17b765f2..361a0f57bdeb 100644
> --- a/arch/sh/include/asm/hw_breakpoint.h
> +++ b/arch/sh/include/asm/hw_breakpoint.h
> @@ -48,10 +48,7 @@ struct pmu;
>  /* Maximum number of UBC channels */
>  #define HBP_NUM                2
>
> -static inline int hw_breakpoint_slots(int type)
> -{
> -       return HBP_NUM;
> -}
> +#define hw_breakpoint_slots(type) (HBP_NUM)
>
>  /* arch/sh/kernel/hw_breakpoint.c */
>  extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
> diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
> index a1f0e90d0818..0bc931cd0698 100644
> --- a/arch/x86/include/asm/hw_breakpoint.h
> +++ b/arch/x86/include/asm/hw_breakpoint.h
> @@ -44,10 +44,7 @@ struct arch_hw_breakpoint {
>  /* Total number of available HW breakpoint registers */
>  #define HBP_NUM 4
>
> -static inline int hw_breakpoint_slots(int type)
> -{
> -       return HBP_NUM;
> -}
> +#define hw_breakpoint_slots(type) (HBP_NUM)
>
>  struct perf_event_attr;
>  struct perf_event;
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 1f718745d569..8e939723f27d 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -41,13 +41,16 @@ struct bp_cpuinfo {
>         /* Number of pinned cpu breakpoints in a cpu */
>         unsigned int    cpu_pinned;
>         /* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
> +#ifdef hw_breakpoint_slots
> +       unsigned int    tsk_pinned[hw_breakpoint_slots(0)];
> +#else
>         unsigned int    *tsk_pinned;
> +#endif
>         /* Number of non-pinned cpu/task breakpoints in a cpu */
>         unsigned int    flexible; /* XXX: placeholder, see fetch_this_slot() */
>  };
>
>  static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
> -static int nr_slots[TYPE_MAX] __ro_after_init;
>
>  static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
>  {
> @@ -74,6 +77,54 @@ struct bp_busy_slots {
>  /* Serialize accesses to the above constraints */
>  static DEFINE_MUTEX(nr_bp_mutex);
>
> +#ifdef hw_breakpoint_slots
> +/*
> + * Number of breakpoint slots is constant, and the same for all types.
> + */
> +static_assert(hw_breakpoint_slots(TYPE_INST) == hw_breakpoint_slots(TYPE_DATA));
> +static inline int hw_breakpoint_slots_cached(int type) { return hw_breakpoint_slots(type); }
> +static inline int init_breakpoint_slots(void)          { return 0; }
> +#else
> +/*
> + * Dynamic number of breakpoint slots.
> + */
> +static int __nr_bp_slots[TYPE_MAX] __ro_after_init;
> +
> +static inline int hw_breakpoint_slots_cached(int type)
> +{
> +       return __nr_bp_slots[type];
> +}
> +
> +static __init int init_breakpoint_slots(void)
> +{
> +       int i, cpu, err_cpu;
> +
> +       for (i = 0; i < TYPE_MAX; i++)
> +               __nr_bp_slots[i] = hw_breakpoint_slots(i);
> +
> +       for_each_possible_cpu(cpu) {
> +               for (i = 0; i < TYPE_MAX; i++) {
> +                       struct bp_cpuinfo *info = get_bp_info(cpu, i);
> +
> +                       info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
> +                       if (!info->tsk_pinned)
> +                               goto err;
> +               }
> +       }
> +
> +       return 0;
> +err:
> +       for_each_possible_cpu(err_cpu) {
> +               for (i = 0; i < TYPE_MAX; i++)
> +                       kfree(get_bp_info(err_cpu, i)->tsk_pinned);
> +               if (err_cpu == cpu)
> +                       break;
> +       }
> +
> +       return -ENOMEM;
> +}
> +#endif
> +
>  __weak int hw_breakpoint_weight(struct perf_event *bp)
>  {
>         return 1;
> @@ -96,7 +147,7 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
>         unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
>         int i;
>
> -       for (i = nr_slots[type] - 1; i >= 0; i--) {
> +       for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
>                 if (tsk_pinned[i] > 0)
>                         return i + 1;
>         }
> @@ -313,7 +364,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
>         fetch_this_slot(&slots, weight);
>
>         /* Flexible counters need to keep at least one slot */
> -       if (slots.pinned + (!!slots.flexible) > nr_slots[type])
> +       if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
>                 return -ENOSPC;
>
>         ret = arch_reserve_bp_slot(bp);
> @@ -688,42 +739,19 @@ static struct pmu perf_breakpoint = {
>
>  int __init init_hw_breakpoint(void)
>  {
> -       int cpu, err_cpu;
> -       int i, ret;
> -
> -       for (i = 0; i < TYPE_MAX; i++)
> -               nr_slots[i] = hw_breakpoint_slots(i);
> -
> -       for_each_possible_cpu(cpu) {
> -               for (i = 0; i < TYPE_MAX; i++) {
> -                       struct bp_cpuinfo *info = get_bp_info(cpu, i);
> -
> -                       info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
> -                                                       GFP_KERNEL);
> -                       if (!info->tsk_pinned) {
> -                               ret = -ENOMEM;
> -                               goto err;
> -                       }
> -               }
> -       }
> +       int ret;
>
>         ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
>         if (ret)
> -               goto err;
> +               return ret;
> +
> +       ret = init_breakpoint_slots();
> +       if (ret)
> +               return ret;
>
>         constraints_initialized = true;
>
>         perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
>
>         return register_die_notifier(&hw_breakpoint_exceptions_nb);
> -
> -err:
> -       for_each_possible_cpu(err_cpu) {
> -               for (i = 0; i < TYPE_MAX; i++)
> -                       kfree(get_bp_info(err_cpu, i)->tsk_pinned);
> -               if (err_cpu == cpu)
> -                       break;
> -       }
> -
> -       return ret;
>  }
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
@ 2022-06-09 12:03   ` Dmitry Vyukov
  2022-06-09 12:08     ` Marco Elver
  2022-06-09 13:25     ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:03 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Due to being a __weak function, hw_breakpoint_weight() will cause the
> compiler to always emit a call to it. This generates unnecessarily bad
> code (register spills etc.) for no good reason; in fact it appears in
> profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
>
>     ...
>     0.70%  [kernel]       [k] hw_breakpoint_weight
>     ...
>
> While a small percentage, no architecture defines its own
> hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
> which makes the fact it is currently __weak a poor choice.
>
> Change hw_breakpoint_weight()'s definition to follow a similar protocol
> to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
> hw_breakpoint_weight(), we'll use it instead.
>
> The result is that it is inlined and no longer shows up in profiles.
>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
>  include/linux/hw_breakpoint.h | 1 -
>  kernel/events/hw_breakpoint.c | 4 +++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
> index 78dd7035d1e5..9fa3547acd87 100644
> --- a/include/linux/hw_breakpoint.h
> +++ b/include/linux/hw_breakpoint.h
> @@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
>  extern int dbg_release_bp_slot(struct perf_event *bp);
>  extern int reserve_bp_slot(struct perf_event *bp);
>  extern void release_bp_slot(struct perf_event *bp);
> -int hw_breakpoint_weight(struct perf_event *bp);
>  int arch_reserve_bp_slot(struct perf_event *bp);
>  void arch_release_bp_slot(struct perf_event *bp);
>  void arch_unregister_hw_breakpoint(struct perf_event *bp);
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 8e939723f27d..5f40c8dfa042 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -125,10 +125,12 @@ static __init int init_breakpoint_slots(void)
>  }
>  #endif
>
> -__weak int hw_breakpoint_weight(struct perf_event *bp)

Humm... this was added in 2010 and never actually used to return
anything other than 1 since then (?). Looks like over-design. Maybe we
drop "#ifndef" and add a comment instead?

> +#ifndef hw_breakpoint_weight
> +static inline int hw_breakpoint_weight(struct perf_event *bp)
>  {
>         return 1;
>  }
> +#endif
>
>  static inline enum bp_type_idx find_slot_idx(u64 bp_type)
>  {
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints
  2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
@ 2022-06-09 12:04   ` Dmitry Vyukov
  2022-06-09 13:41     ` Dmitry Vyukov
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:04 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Flexible breakpoints have never been implemented, with
> bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
> bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
> flexible count in fetch_bp_busy_slots().
>
> This again causes suboptimal code generation, when we always know that
> `!!slots.flexible` will be 0.
>
> Just get rid of the flexible "placeholder" and remove all real code
> related to it. Make a note in the comment related to the constraints
> algorithm but don't remove them from the algorithm, so that if in future
> flexible breakpoints need supporting, it should be trivial to revive
> them (along with reverting this change).
>
> Signed-off-by: Marco Elver <elver@google.com>

Was added in 2009.

Acked-by: Dmitry Vyukov <dvyukov@google.com>

> ---
>  kernel/events/hw_breakpoint.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 5f40c8dfa042..afe0a6007e96 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -46,8 +46,6 @@ struct bp_cpuinfo {
>  #else
>         unsigned int    *tsk_pinned;
>  #endif
> -       /* Number of non-pinned cpu/task breakpoints in a cpu */
> -       unsigned int    flexible; /* XXX: placeholder, see fetch_this_slot() */
>  };
>
>  static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
> @@ -71,7 +69,6 @@ static bool constraints_initialized __ro_after_init;
>  /* Gather the number of total pinned and un-pinned bp in a cpuset */
>  struct bp_busy_slots {
>         unsigned int pinned;
> -       unsigned int flexible;
>  };
>
>  /* Serialize accesses to the above constraints */
> @@ -213,10 +210,6 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp,
>
>                 if (nr > slots->pinned)
>                         slots->pinned = nr;
> -
> -               nr = info->flexible;
> -               if (nr > slots->flexible)
> -                       slots->flexible = nr;
>         }
>  }
>
> @@ -299,7 +292,8 @@ __weak void arch_unregister_hw_breakpoint(struct perf_event *bp)
>  }
>
>  /*
> - * Constraints to check before allowing this new breakpoint counter:
> + * Constraints to check before allowing this new breakpoint counter. Note that
> + * flexible breakpoints are currently unsupported -- see fetch_this_slot().
>   *
>   *  == Non-pinned counter == (Considered as pinned for now)
>   *
> @@ -366,7 +360,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
>         fetch_this_slot(&slots, weight);
>
>         /* Flexible counters need to keep at least one slot */
> -       if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
> +       if (slots.pinned > hw_breakpoint_slots_cached(type))
>                 return -ENOSPC;
>
>         ret = arch_reserve_bp_slot(bp);
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  2022-06-09 12:03   ` Dmitry Vyukov
@ 2022-06-09 12:08     ` Marco Elver
  2022-06-09 12:23       ` Dmitry Vyukov
  2022-06-09 13:25     ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 12:08 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 14:03, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> >
> > Due to being a __weak function, hw_breakpoint_weight() will cause the
> > compiler to always emit a call to it. This generates unnecessarily bad
> > code (register spills etc.) for no good reason; in fact it appears in
> > profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
> >
> >     ...
> >     0.70%  [kernel]       [k] hw_breakpoint_weight
> >     ...
> >
> > While a small percentage, no architecture defines its own
> > hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
> > which makes the fact it is currently __weak a poor choice.
> >
> > Change hw_breakpoint_weight()'s definition to follow a similar protocol
> > to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
> > hw_breakpoint_weight(), we'll use it instead.
> >
> > The result is that it is inlined and no longer shows up in profiles.
> >
> > Signed-off-by: Marco Elver <elver@google.com>
> > ---
> >  include/linux/hw_breakpoint.h | 1 -
> >  kernel/events/hw_breakpoint.c | 4 +++-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
> > index 78dd7035d1e5..9fa3547acd87 100644
> > --- a/include/linux/hw_breakpoint.h
> > +++ b/include/linux/hw_breakpoint.h
> > @@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
> >  extern int dbg_release_bp_slot(struct perf_event *bp);
> >  extern int reserve_bp_slot(struct perf_event *bp);
> >  extern void release_bp_slot(struct perf_event *bp);
> > -int hw_breakpoint_weight(struct perf_event *bp);
> >  int arch_reserve_bp_slot(struct perf_event *bp);
> >  void arch_release_bp_slot(struct perf_event *bp);
> >  void arch_unregister_hw_breakpoint(struct perf_event *bp);
> > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> > index 8e939723f27d..5f40c8dfa042 100644
> > --- a/kernel/events/hw_breakpoint.c
> > +++ b/kernel/events/hw_breakpoint.c
> > @@ -125,10 +125,12 @@ static __init int init_breakpoint_slots(void)
> >  }
> >  #endif
> >
> > -__weak int hw_breakpoint_weight(struct perf_event *bp)
>
> Humm... this was added in 2010 and never actually used to return
> anything other than 1 since then (?). Looks like over-design. Maybe we
> drop "#ifndef" and add a comment instead?

Then there's little reason for the function either and we can just
directly increment/decrement 1 everywhere. If we drop the ability for
an arch to override, I feel that'd be cleaner.

Either way, codegen won't change though.

Preferences?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/8] perf/hw_breakpoint: Clean up headers
  2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
@ 2022-06-09 12:11   ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:11 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Clean up headers:
>
>  - Remove unused <linux/kallsyms.h>
>
>  - Remove unused <linux/kprobes.h>
>
>  - Remove unused <linux/module.h>
>
>  - Remove unused <linux/smp.h>
>
>  - Add <linux/export.h> for EXPORT_SYMBOL_GPL().
>
>  - Sort alphabetically.
>
>  - Move <linux/hw_breakpoint.h> to top to test it compiles on its own.
>
> Signed-off-by: Marco Elver <elver@google.com>

Acked-by: Dmitry Vyukov <dvyukov@google.com>

> ---
>  kernel/events/hw_breakpoint.c | 20 +++++++++-----------
>  1 file changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 3b33a4075104..e9aa7f2c031a 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -17,26 +17,24 @@
>   * This file contains the arch-independent routines.
>   */
>
> +#include <linux/hw_breakpoint.h>
> +
>  #include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/cpu.h>
> +#include <linux/export.h>
> +#include <linux/init.h>
>  #include <linux/irqflags.h>
> -#include <linux/kallsyms.h>
> -#include <linux/notifier.h>
> -#include <linux/kprobes.h>
>  #include <linux/kdebug.h>
>  #include <linux/kernel.h>
> -#include <linux/module.h>
>  #include <linux/mutex.h>
> +#include <linux/notifier.h>
>  #include <linux/percpu.h>
> +#include <linux/rhashtable.h>
>  #include <linux/sched.h>
> -#include <linux/spinlock.h>
> -#include <linux/init.h>
>  #include <linux/slab.h>
> -#include <linux/rhashtable.h>
> -#include <linux/cpu.h>
> -#include <linux/smp.h>
> -#include <linux/bug.h>
> +#include <linux/spinlock.h>
>
> -#include <linux/hw_breakpoint.h>
>  /*
>   * Constraints data
>   */
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  2022-06-09 12:08     ` Marco Elver
@ 2022-06-09 12:23       ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:23 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 14:08, Marco Elver <elver@google.com> wrote:
> > > Due to being a __weak function, hw_breakpoint_weight() will cause the
> > > compiler to always emit a call to it. This generates unnecessarily bad
> > > code (register spills etc.) for no good reason; in fact it appears in
> > > profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
> > >
> > >     ...
> > >     0.70%  [kernel]       [k] hw_breakpoint_weight
> > >     ...
> > >
> > > While a small percentage, no architecture defines its own
> > > hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
> > > which makes the fact it is currently __weak a poor choice.
> > >
> > > Change hw_breakpoint_weight()'s definition to follow a similar protocol
> > > to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
> > > hw_breakpoint_weight(), we'll use it instead.
> > >
> > > The result is that it is inlined and no longer shows up in profiles.
> > >
> > > Signed-off-by: Marco Elver <elver@google.com>
> > > ---
> > >  include/linux/hw_breakpoint.h | 1 -
> > >  kernel/events/hw_breakpoint.c | 4 +++-
> > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
> > > index 78dd7035d1e5..9fa3547acd87 100644
> > > --- a/include/linux/hw_breakpoint.h
> > > +++ b/include/linux/hw_breakpoint.h
> > > @@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
> > >  extern int dbg_release_bp_slot(struct perf_event *bp);
> > >  extern int reserve_bp_slot(struct perf_event *bp);
> > >  extern void release_bp_slot(struct perf_event *bp);
> > > -int hw_breakpoint_weight(struct perf_event *bp);
> > >  int arch_reserve_bp_slot(struct perf_event *bp);
> > >  void arch_release_bp_slot(struct perf_event *bp);
> > >  void arch_unregister_hw_breakpoint(struct perf_event *bp);
> > > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> > > index 8e939723f27d..5f40c8dfa042 100644
> > > --- a/kernel/events/hw_breakpoint.c
> > > +++ b/kernel/events/hw_breakpoint.c
> > > @@ -125,10 +125,12 @@ static __init int init_breakpoint_slots(void)
> > >  }
> > >  #endif
> > >
> > > -__weak int hw_breakpoint_weight(struct perf_event *bp)
> >
> > Humm... this was added in 2010 and never actually used to return
> > anything other than 1 since then (?). Looks like over-design. Maybe we
> > drop "#ifndef" and add a comment instead?
>
> Then there's little reason for the function either and we can just
> directly increment/decrement 1 everywhere. If we drop the ability for
> an arch to override, I feel that'd be cleaner.
>
> Either way, codegen won't change though.
>
> Preferences?

I don't have strong preferences either way.
Can also be:
#define HW_BREAKPOINT_WEIGHT 1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks
  2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
                   ` (7 preceding siblings ...)
  2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
@ 2022-06-09 12:28 ` Dmitry Vyukov
  8 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:28 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:30, Marco Elver <elver@google.com> wrote:
>
> The hw_breakpoint subsystem's code has seen little change in over 10
> years. In that time, systems with >100s of CPUs have become common,
> along with improvements to the perf subsystem: using breakpoints on
> thousands of concurrent tasks should be a supported usecase.
>
> The breakpoint constraints accounting algorithm is the major bottleneck
> in doing so:
>
>   1. task_bp_pinned() has been O(#tasks), and called twice for each CPU.
>
>   2. Everything is serialized on a global mutex, 'nr_bp_mutex'.
>
> This series first optimizes task_bp_pinned() to only take O(1) on
> average, and then reworks synchronization to allow concurrency when
> checking and updating breakpoint constraints for tasks. Along the way,
> smaller micro-optimizations and cleanups are done as they seemed obvious
> when staring at the code (but likely insignificant).
>
> The result is (on a system with 256 CPUs) that we go from:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>                 [ ^ more aggressive benchmark parameters took too long ]
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 236.418 [sec]
>  |
>  |   123134.794271 usecs/op
>  |  7880626.833333 usecs/op/cpu
>
> ... to -- with all optimizations:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.071 [sec]
>  |
>  |       37.134896 usecs/op
>  |     2376.633333 usecs/op/cpu
>
> On the used test system, that's an effective speedup of ~3315x per op.

Awesome!

> Which is close to the theoretical ideal performance through
> optimizations in hw_breakpoint.c -- for reference, constraints
> accounting disabled:
>
>  | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.067 [sec]
>  |
>  |       35.286458 usecs/op
>  |     2258.333333 usecs/op/cpu
>
> At this point, the current implementation is only ~5% slower than the
> theoretical ideal. However, given constraints accounting cannot
> realistically be disabled, this is likely as far as we can push it.
>
> Marco Elver (8):
>   perf/hw_breakpoint: Optimize list of per-task breakpoints
>   perf/hw_breakpoint: Mark data __ro_after_init
>   perf/hw_breakpoint: Optimize constant number of breakpoint slots
>   perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
>   perf/hw_breakpoint: Remove useless code related to flexible
>     breakpoints
>   perf/hw_breakpoint: Reduce contention with large number of tasks
>   perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
>   perf/hw_breakpoint: Clean up headers
>
>  arch/sh/include/asm/hw_breakpoint.h  |   5 +-
>  arch/x86/include/asm/hw_breakpoint.h |   5 +-
>  include/linux/hw_breakpoint.h        |   1 -
>  include/linux/perf_event.h           |   3 +-
>  kernel/events/hw_breakpoint.c        | 374 +++++++++++++++++++--------
>  5 files changed, 276 insertions(+), 112 deletions(-)
>
> --
> 2.36.1.255.ge46751e96f-goog

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
@ 2022-06-09 12:30   ` Dmitry Vyukov
  2022-06-09 12:53     ` Marco Elver
  2022-06-09 14:29   ` Dmitry Vyukov
  1 sibling, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 12:30 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> On a machine with 256 CPUs, running the recently added perf breakpoint
> benchmark results in:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 236.418 [sec]
>  |
>  |   123134.794271 usecs/op
>  |  7880626.833333 usecs/op/cpu
>
> The benchmark tests inherited breakpoint perf events across many
> threads.
>
> Looking at a perf profile, we can see that the majority of the time is
> spent in various hw_breakpoint.c functions, which execute within the
> 'nr_bp_mutex' critical sections which then results in contention on that
> mutex as well:
>
>     37.27%  [kernel]       [k] osq_lock
>     34.92%  [kernel]       [k] mutex_spin_on_owner
>     12.15%  [kernel]       [k] toggle_bp_slot
>     11.90%  [kernel]       [k] __reserve_bp_slot
>
> The culprit here is task_bp_pinned(), which has a runtime complexity of
> O(#tasks) due to storing all task breakpoints in the same list and
> iterating through that list looking for a matching task. Clearly, this
> does not scale to thousands of tasks.
>
> While one option would be to make task_struct a breakpoint list node,
> this would only further bloat task_struct for infrequently used data.
>
> Instead, make use of the "rhashtable" variant "rhltable" which stores
> multiple items with the same key in a list. This results in average
> runtime complexity of O(1) for task_bp_pinned().
>
> With the optimization, the benchmark shows:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.208 [sec]
>  |
>  |      108.422396 usecs/op
>  |     6939.033333 usecs/op/cpu
>
> On this particular setup that's a speedup of ~1135x.
>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
>  include/linux/perf_event.h    |  3 +-
>  kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++-------------
>  2 files changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 01231f1d976c..e27360436dc6 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks {
>  };
>
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> +#include <linux/rhashtable-types.h>
>  #include <asm/hw_breakpoint.h>
>  #endif
>
> @@ -178,7 +179,7 @@ struct hw_perf_event {
>                          * creation and event initalization.
>                          */
>                         struct arch_hw_breakpoint       info;
> -                       struct list_head                bp_list;
> +                       struct rhlist_head              bp_list;
>                 };
>  #endif
>                 struct { /* amd_iommu */
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index f32320ac02fd..25c94c6e918d 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -28,7 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/init.h>
>  #include <linux/slab.h>
> -#include <linux/list.h>
> +#include <linux/rhashtable.h>
>  #include <linux/cpu.h>
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -55,7 +55,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
>  }
>
>  /* Keep track of the breakpoints attached to tasks */
> -static LIST_HEAD(bp_task_head);
> +static struct rhltable task_bps_ht;
> +static const struct rhashtable_params task_bps_ht_params = {
> +       .head_offset = offsetof(struct hw_perf_event, bp_list),
> +       .key_offset = offsetof(struct hw_perf_event, target),
> +       .key_len = sizeof_field(struct hw_perf_event, target),
> +       .automatic_shrinking = true,
> +};
>
>  static int constraints_initialized;
>
> @@ -104,17 +110,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
>   */
>  static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
>  {
> -       struct task_struct *tsk = bp->hw.target;
> +       struct rhlist_head *head, *pos;
>         struct perf_event *iter;
>         int count = 0;
>
> -       list_for_each_entry(iter, &bp_task_head, hw.bp_list) {
> -               if (iter->hw.target == tsk &&
> -                   find_slot_idx(iter->attr.bp_type) == type &&
> +       rcu_read_lock();

Why do we need rcu_read_lock() here?
The patch does not change anything with respect to locking, so all
accesses to the container should still be protected by nr_bp_mutex.
Similarly for the rcu variant of for_each below.

Otherwise the change looks good to me.



> +       head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
> +       if (!head)
> +               goto out;
> +
> +       rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
> +               if (find_slot_idx(iter->attr.bp_type) == type &&
>                     (iter->cpu < 0 || cpu == iter->cpu))
>                         count += hw_breakpoint_weight(iter);
>         }
>
> +out:
> +       rcu_read_unlock();
>         return count;
>  }
>
> @@ -187,7 +199,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
>  /*
>   * Add/remove the given breakpoint in our constraint table
>   */
> -static void
> +static int
>  toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>                int weight)
>  {
> @@ -200,7 +212,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>         /* Pinned counter cpu profiling */
>         if (!bp->hw.target) {
>                 get_bp_info(bp->cpu, type)->cpu_pinned += weight;
> -               return;
> +               return 0;
>         }
>
>         /* Pinned counter task profiling */
> @@ -208,9 +220,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>                 toggle_bp_task_slot(bp, cpu, type, weight);
>
>         if (enable)
> -               list_add_tail(&bp->hw.bp_list, &bp_task_head);
> +               return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
>         else
> -               list_del(&bp->hw.bp_list);
> +               return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
>  }
>
>  __weak int arch_reserve_bp_slot(struct perf_event *bp)
> @@ -308,9 +320,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
>         if (ret)
>                 return ret;
>
> -       toggle_bp_slot(bp, true, type, weight);
> -
> -       return 0;
> +       return toggle_bp_slot(bp, true, type, weight);
>  }
>
>  int reserve_bp_slot(struct perf_event *bp)
> @@ -335,7 +345,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
>
>         type = find_slot_idx(bp_type);
>         weight = hw_breakpoint_weight(bp);
> -       toggle_bp_slot(bp, false, type, weight);
> +       WARN_ON(toggle_bp_slot(bp, false, type, weight));
>  }
>
>  void release_bp_slot(struct perf_event *bp)
> @@ -679,7 +689,7 @@ static struct pmu perf_breakpoint = {
>  int __init init_hw_breakpoint(void)
>  {
>         int cpu, err_cpu;
> -       int i;
> +       int i, ret;
>
>         for (i = 0; i < TYPE_MAX; i++)
>                 nr_slots[i] = hw_breakpoint_slots(i);
> @@ -690,18 +700,24 @@ int __init init_hw_breakpoint(void)
>
>                         info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
>                                                         GFP_KERNEL);
> -                       if (!info->tsk_pinned)
> -                               goto err_alloc;
> +                       if (!info->tsk_pinned) {
> +                               ret = -ENOMEM;
> +                               goto err;
> +                       }
>                 }
>         }
>
> +       ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
> +       if (ret)
> +               goto err;
> +
>         constraints_initialized = 1;
>
>         perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
>
>         return register_die_notifier(&hw_breakpoint_exceptions_nb);
>
> - err_alloc:
> +err:
>         for_each_possible_cpu(err_cpu) {
>                 for (i = 0; i < TYPE_MAX; i++)
>                         kfree(get_bp_info(err_cpu, i)->tsk_pinned);
> @@ -709,7 +725,5 @@ int __init init_hw_breakpoint(void)
>                         break;
>         }
>
> -       return -ENOMEM;
> +       return ret;
>  }
> -
> -
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 12:30   ` Dmitry Vyukov
@ 2022-06-09 12:53     ` Marco Elver
  2022-06-09 13:05       ` Dmitry Vyukov
  0 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 12:53 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, Jun 09, 2022 at 02:30PM +0200, Dmitry Vyukov wrote:
[...]
> > +       rcu_read_lock();
> 
> Why do we need rcu_read_lock() here?
> The patch does not change anything with respect to locking, so all
> accesses to the container should still be protected by nr_bp_mutex.
> Similarly for the rcu variant of for_each below.
[...]
> > +       head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
> > +       if (!head)
> > +               goto out;
> > +
> > +       rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {

It's part of rhashtable's interface requirements:

	/**
	 * rhltable_lookup - search hash list table
	 * @hlt:	hash table
	 * @key:	the pointer to the key
	 * @params:	hash table parameters
	 *
	 * Computes the hash value for the key and traverses the bucket chain looking
	 * for a entry with an identical key.  All matching entries are returned
	 * in a list.
	 *
	 * This must only be called under the RCU read lock.
	 *
	 * Returns the list of entries that match the given key.
	 */

Beyond that, even though there might not appear to be any concurrent
rhashtable modifications, it'll be allowed in patch 6/8. Furthermore,
rhashtable actually does concurrent background compactions since I
selected 'automatic_shrinking = true' (so we don't leak tons of memory
after starting and killing those 1000s of tasks) -- there's this
call_rcu() in lib/rhashtable.c that looks like that's when it's used.
This work is done in a deferred work by rht_deferred_worker().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks
  2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
@ 2022-06-09 13:03   ` Dmitry Vyukov
  2022-06-09 13:29     ` Marco Elver
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 13:03 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

."On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> While optimizing task_bp_pinned()'s runtime complexity to O(1) on
> average helps reduce time spent in the critical section, we still suffer
> due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
> that now contention is the biggest issue:
>
>     95.93%  [kernel]       [k] osq_lock
>      0.70%  [kernel]       [k] mutex_spin_on_owner
>      0.22%  [kernel]       [k] smp_cfm_core_cond
>      0.18%  [kernel]       [k] task_bp_pinned
>      0.18%  [kernel]       [k] rhashtable_jhash2
>      0.15%  [kernel]       [k] queued_spin_lock_slowpath
>
> when running the breakpoint benchmark with (system with 256 CPUs):
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.207 [sec]
>  |
>  |      108.267188 usecs/op
>  |     6929.100000 usecs/op/cpu
>
> The main concern for synchronizing the breakpoint constraints data is
> that a consistent snapshot of the per-CPU and per-task data is observed.
>
> The access pattern is as follows:
>
>  1. If the target is a task: the task's pinned breakpoints are counted,
>     checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
>     is used to check for conflicts with CPU-only breakpoints;
>     bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
>     unused.
>
>  2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
>     with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
>     incremented. No per-task breakpoints are checked.
>
> Since rhltable safely synchronizes insertions/deletions, we can allow
> concurrency as follows:
>
>  1. If the target is a task: independent tasks may update and check the
>     constraints concurrently, but same-task target calls need to be
>     serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
>     checked, these modifications can happen concurrently by switching
>     tsk_pinned to atomic_t.
>
>  2. If the target is a CPU: access to the per-CPU constraints needs to
>     be serialized with other CPU-target and task-target callers (to
>     stabilize the bp_cpuinfo::tsk_pinned snapshot).
>
> We can allow the above concurrency by introducing a per-CPU constraints
> data reader-writer lock (bp_cpuinfo_lock), and per-task mutexes
> (task_sharded_mtx):
>
>   1. If the target is a task: acquires its task_sharded_mtx, and
>      acquires bp_cpuinfo_lock as a reader.
>
>   2. If the target is a CPU: acquires bp_cpuinfo_lock as a writer.
>
> With these changes, contention with thousands of tasks is reduced to the
> point where waiting on locking no longer dominates the profile:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.080 [sec]
>  |
>  |       42.048437 usecs/op
>  |     2691.100000 usecs/op/cpu
>
>     21.31%  [kernel]       [k] task_bp_pinned
>     17.49%  [kernel]       [k] rhashtable_jhash2
>      5.29%  [kernel]       [k] toggle_bp_slot
>      4.45%  [kernel]       [k] mutex_spin_on_owner
>      3.72%  [kernel]       [k] bcmp
>
> On this particular setup that's a speedup of 2.5x.
>
> We're also getting closer to the theoretical ideal performance through
> optimizations in hw_breakpoint.c -- constraints accounting disabled:
>
>  | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.067 [sec]
>  |
>  |       35.286458 usecs/op
>  |     2258.333333 usecs/op/cpu
>
> Which means the current implementation is ~19% slower than the
> theoretical ideal.
>
> For reference, performance without any breakpoints:
>
>  | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
>  |      Total time: 0.060 [sec]
>  |
>  |       31.365625 usecs/op
>  |     2007.400000 usecs/op/cpu
>
> The theoretical ideal is only ~12% slower than no breakpoints at all.
> The current implementation is ~34% slower than no breakpoints at all.
> (On a system with 256 CPUs.)
>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
>  kernel/events/hw_breakpoint.c | 155 ++++++++++++++++++++++++++++------
>  1 file changed, 128 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index afe0a6007e96..08c9ed0626e4 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -17,6 +17,7 @@
>   * This file contains the arch-independent routines.
>   */
>
> +#include <linux/atomic.h>
>  #include <linux/irqflags.h>
>  #include <linux/kallsyms.h>
>  #include <linux/notifier.h>
> @@ -24,8 +25,10 @@
>  #include <linux/kdebug.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> +#include <linux/mutex.h>
>  #include <linux/percpu.h>
>  #include <linux/sched.h>
> +#include <linux/spinlock.h>
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/rhashtable.h>
> @@ -42,9 +45,9 @@ struct bp_cpuinfo {
>         unsigned int    cpu_pinned;
>         /* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
>  #ifdef hw_breakpoint_slots
> -       unsigned int    tsk_pinned[hw_breakpoint_slots(0)];
> +       atomic_t        tsk_pinned[hw_breakpoint_slots(0)];
>  #else
> -       unsigned int    *tsk_pinned;
> +       atomic_t        *tsk_pinned;
>  #endif
>  };
>
> @@ -71,8 +74,81 @@ struct bp_busy_slots {
>         unsigned int pinned;
>  };
>
> -/* Serialize accesses to the above constraints */
> -static DEFINE_MUTEX(nr_bp_mutex);
> +/*
> + * Synchronizes accesses to the per-CPU constraints; users of data in bp_cpuinfo
> + * must acquire bp_cpuinfo_lock as writer to get a stable snapshot of all CPUs'
> + * constraints. Modifications without use may only acquire bp_cpuinfo_lock as a
> + * reader, but must otherwise ensure modifications are never lost.
> + */

I can't understand this comment.
Modifications need to acquire in read mode, while only users must
acquire in write mode. Shouldn't it be the other way around? What is
"Modifications without use"?


> +static DEFINE_RWLOCK(bp_cpuinfo_lock);
> +
> +/*
> + * Synchronizes accesses to the per-task breakpoint list in task_bps_ht. Since
> + * rhltable synchronizes concurrent insertions/deletions, independent tasks may
> + * insert/delete concurrently; therefore, a mutex per task would be sufficient.
> + *
> + * To avoid bloating task_struct with infrequently used data, use a sharded
> + * mutex that scales with number of CPUs.
> + */
> +static DEFINE_PER_CPU(struct mutex, task_sharded_mtx);
> +
> +static struct mutex *get_task_sharded_mtx(struct perf_event *bp)
> +{
> +       int shard;
> +
> +       if (!bp->hw.target)
> +               return NULL;
> +
> +       /*
> +        * Compute a valid shard index into per-CPU data.
> +        */
> +       shard = task_pid_nr(bp->hw.target) % nr_cpu_ids;
> +       shard = cpumask_next(shard - 1, cpu_possible_mask);
> +       if (shard >= nr_cpu_ids)
> +               shard = cpumask_first(cpu_possible_mask);
> +
> +       return per_cpu_ptr(&task_sharded_mtx, shard);
> +}
> +
> +static struct mutex *bp_constraints_lock(struct perf_event *bp)
> +{
> +       struct mutex *mtx = get_task_sharded_mtx(bp);
> +
> +       if (mtx) {
> +               mutex_lock(mtx);
> +               read_lock(&bp_cpuinfo_lock);

Is NR_CPUS == 1 case still important to optimize? I guess with small
VMs it may be important again.
If so, we could just write-lock bp_cpuinfo_lock always if NR_CPUS == 1.

> +       } else {
> +               write_lock(&bp_cpuinfo_lock);
> +       }
> +
> +       return mtx;
> +}
> +
> +static void bp_constraints_unlock(struct mutex *mtx)
> +{
> +       if (mtx) {
> +               read_unlock(&bp_cpuinfo_lock);
> +               mutex_unlock(mtx);
> +       } else {
> +               write_unlock(&bp_cpuinfo_lock);
> +       }
> +}
> +
> +static bool bp_constraints_is_locked(struct perf_event *bp)
> +{
> +       struct mutex *mtx = get_task_sharded_mtx(bp);
> +
> +       return (mtx ? mutex_is_locked(mtx) : false) ||
> +              rwlock_is_contended(&bp_cpuinfo_lock);
> +}
> +
> +static inline void assert_bp_constraints_lock_held(struct perf_event *bp)
> +{
> +       lockdep_assert_held(&bp_cpuinfo_lock);
> +       /* Don't call get_task_sharded_mtx() if lockdep is disabled. */
> +       if (IS_ENABLED(CONFIG_LOCKDEP) && bp->hw.target)
> +               lockdep_assert_held(get_task_sharded_mtx(bp));
> +}
>
>  #ifdef hw_breakpoint_slots
>  /*
> @@ -103,7 +179,7 @@ static __init int init_breakpoint_slots(void)
>                 for (i = 0; i < TYPE_MAX; i++) {
>                         struct bp_cpuinfo *info = get_bp_info(cpu, i);
>
> -                       info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
> +                       info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(atomic_t), GFP_KERNEL);
>                         if (!info->tsk_pinned)
>                                 goto err;
>                 }
> @@ -143,11 +219,19 @@ static inline enum bp_type_idx find_slot_idx(u64 bp_type)
>   */
>  static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
>  {
> -       unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
> +       atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
>         int i;
>
> +       /*
> +        * At this point we want to have acquired the bp_cpuinfo_lock as a
> +        * writer to ensure that there are no concurrent writers in
> +        * toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot.
> +        */
> +       lockdep_assert_held_write(&bp_cpuinfo_lock);
> +
>         for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
> -               if (tsk_pinned[i] > 0)
> +               ASSERT_EXCLUSIVE_WRITER(tsk_pinned[i]); /* Catch unexpected writers. */
> +               if (atomic_read(&tsk_pinned[i]) > 0)
>                         return i + 1;
>         }
>
> @@ -164,6 +248,11 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
>         struct perf_event *iter;
>         int count = 0;
>
> +       /*
> +        * We need a stable snapshot of the per-task breakpoint list.
> +        */
> +       assert_bp_constraints_lock_held(bp);
> +
>         rcu_read_lock();
>         head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
>         if (!head)
> @@ -230,16 +319,25 @@ fetch_this_slot(struct bp_busy_slots *slots, int weight)
>  static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
>                                 enum bp_type_idx type, int weight)
>  {
> -       unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
> +       atomic_t *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
>         int old_idx, new_idx;
>
> +       /*
> +        * If bp->hw.target, tsk_pinned is only modified, but not used
> +        * otherwise. We can permit concurrent updates as long as there are no
> +        * other uses: having acquired bp_cpuinfo_lock as a reader allows
> +        * concurrent updates here. Uses of tsk_pinned will require acquiring
> +        * bp_cpuinfo_lock as a writer to stabilize tsk_pinned's value.
> +        */
> +       lockdep_assert_held_read(&bp_cpuinfo_lock);
> +
>         old_idx = task_bp_pinned(cpu, bp, type) - 1;
>         new_idx = old_idx + weight;
>
>         if (old_idx >= 0)
> -               tsk_pinned[old_idx]--;
> +               atomic_dec(&tsk_pinned[old_idx]);
>         if (new_idx >= 0)
> -               tsk_pinned[new_idx]++;
> +               atomic_inc(&tsk_pinned[new_idx]);
>  }
>
>  /*
> @@ -257,6 +355,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>
>         /* Pinned counter cpu profiling */
>         if (!bp->hw.target) {
> +               lockdep_assert_held_write(&bp_cpuinfo_lock);
>                 get_bp_info(bp->cpu, type)->cpu_pinned += weight;
>                 return 0;
>         }
> @@ -265,6 +364,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>         for_each_cpu(cpu, cpumask)
>                 toggle_bp_task_slot(bp, cpu, type, weight);
>
> +       /*
> +        * Readers want a stable snapshot of the per-task breakpoint list.
> +        */
> +       assert_bp_constraints_lock_held(bp);
> +
>         if (enable)
>                 return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
>         else
> @@ -372,14 +476,10 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
>
>  int reserve_bp_slot(struct perf_event *bp)
>  {
> -       int ret;
> -
> -       mutex_lock(&nr_bp_mutex);
> -
> -       ret = __reserve_bp_slot(bp, bp->attr.bp_type);
> -
> -       mutex_unlock(&nr_bp_mutex);
> +       struct mutex *mtx = bp_constraints_lock(bp);
> +       int ret = __reserve_bp_slot(bp, bp->attr.bp_type);
>
> +       bp_constraints_unlock(mtx);
>         return ret;
>  }
>
> @@ -397,12 +497,11 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
>
>  void release_bp_slot(struct perf_event *bp)
>  {
> -       mutex_lock(&nr_bp_mutex);
> +       struct mutex *mtx = bp_constraints_lock(bp);
>
>         arch_unregister_hw_breakpoint(bp);

If I understand this correctly, this can weaken protection for
arch_unregister_hw_breakpoint() and __modify_bp_slot(). Previously
they were globally serialized, but now several calls can run in
parallel. Is it OK?

>         __release_bp_slot(bp, bp->attr.bp_type);
> -
> -       mutex_unlock(&nr_bp_mutex);
> +       bp_constraints_unlock(mtx);
>  }
>
>  static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
> @@ -429,11 +528,10 @@ static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
>
>  static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
>  {
> -       int ret;
> +       struct mutex *mtx = bp_constraints_lock(bp);
> +       int ret = __modify_bp_slot(bp, old_type, new_type);
>
> -       mutex_lock(&nr_bp_mutex);
> -       ret = __modify_bp_slot(bp, old_type, new_type);
> -       mutex_unlock(&nr_bp_mutex);
> +       bp_constraints_unlock(mtx);
>         return ret;
>  }
>
> @@ -444,7 +542,7 @@ static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_type)
>   */
>  int dbg_reserve_bp_slot(struct perf_event *bp)
>  {
> -       if (mutex_is_locked(&nr_bp_mutex))
> +       if (bp_constraints_is_locked(bp))
>                 return -1;
>
>         return __reserve_bp_slot(bp, bp->attr.bp_type);
> @@ -452,7 +550,7 @@ int dbg_reserve_bp_slot(struct perf_event *bp)
>
>  int dbg_release_bp_slot(struct perf_event *bp)
>  {
> -       if (mutex_is_locked(&nr_bp_mutex))
> +       if (bp_constraints_is_locked(bp))
>                 return -1;
>
>         __release_bp_slot(bp, bp->attr.bp_type);
> @@ -735,7 +833,10 @@ static struct pmu perf_breakpoint = {
>
>  int __init init_hw_breakpoint(void)
>  {
> -       int ret;
> +       int cpu, ret;
> +
> +       for_each_possible_cpu(cpu)
> +               mutex_init(&per_cpu(task_sharded_mtx, cpu));
>
>         ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
>         if (ret)
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 12:53     ` Marco Elver
@ 2022-06-09 13:05       ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 13:05 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 14:53, Marco Elver <elver@google.com> wrote:
>
> On Thu, Jun 09, 2022 at 02:30PM +0200, Dmitry Vyukov wrote:
> [...]
> > > +       rcu_read_lock();
> >
> > Why do we need rcu_read_lock() here?
> > The patch does not change anything with respect to locking, so all
> > accesses to the container should still be protected by nr_bp_mutex.
> > Similarly for the rcu variant of for_each below.
> [...]
> > > +       head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
> > > +       if (!head)
> > > +               goto out;
> > > +
> > > +       rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
>
> It's part of rhashtable's interface requirements:
>
>         /**
>          * rhltable_lookup - search hash list table
>          * @hlt:        hash table
>          * @key:        the pointer to the key
>          * @params:     hash table parameters
>          *
>          * Computes the hash value for the key and traverses the bucket chain looking
>          * for a entry with an identical key.  All matching entries are returned
>          * in a list.
>          *
>          * This must only be called under the RCU read lock.
>          *
>          * Returns the list of entries that match the given key.
>          */
>
> Beyond that, even though there might not appear to be any concurrent
> rhashtable modifications, it'll be allowed in patch 6/8. Furthermore,
> rhashtable actually does concurrent background compactions since I
> selected 'automatic_shrinking = true' (so we don't leak tons of memory
> after starting and killing those 1000s of tasks) -- there's this
> call_rcu() in lib/rhashtable.c that looks like that's when it's used.
> This work is done in a deferred work by rht_deferred_worker().


I see. Thanks.

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
  2022-06-09 12:03   ` Dmitry Vyukov
  2022-06-09 12:08     ` Marco Elver
@ 2022-06-09 13:25     ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2022-06-09 13:25 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Marco Elver, Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, linux-perf-users, x86, linux-sh,
	kasan-dev, linux-kernel

On Thu, Jun 09, 2022 at 02:03:12PM +0200, Dmitry Vyukov wrote:

> > -__weak int hw_breakpoint_weight(struct perf_event *bp)
> 
> Humm... this was added in 2010 and never actually used to return
> anything other than 1 since then (?). Looks like over-design. Maybe we
> drop "#ifndef" and add a comment instead?

Frederic, you have any recollection what this was supposed to go do?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks
  2022-06-09 13:03   ` Dmitry Vyukov
@ 2022-06-09 13:29     ` Marco Elver
  0 siblings, 0 replies; 34+ messages in thread
From: Marco Elver @ 2022-06-09 13:29 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, Jun 09, 2022 at 03:03PM +0200, Dmitry Vyukov wrote:
[...]
> > -/* Serialize accesses to the above constraints */
> > -static DEFINE_MUTEX(nr_bp_mutex);
> > +/*
> > + * Synchronizes accesses to the per-CPU constraints; users of data in bp_cpuinfo
> > + * must acquire bp_cpuinfo_lock as writer to get a stable snapshot of all CPUs'
> > + * constraints. Modifications without use may only acquire bp_cpuinfo_lock as a
> > + * reader, but must otherwise ensure modifications are never lost.
> > + */
> 
> I can't understand this comment.
> Modifications need to acquire in read mode, while only users must
> acquire in write mode. Shouldn't it be the other way around? What is
> "Modifications without use"?

Right, maybe this comment needs tweaking.

The main rules are -- the obvious ones:

	 - plain reads are ok with just a read-lock (target is task,
	   reading 'cpu_pinned');

	 - plain writes need a write-lock (target is CPU, writing
	   'cpu_pinned');

the not so obvious one:

	- "modification without use" are the increment/decrement of
	  tsk_pinned done if the target is a task; in this case, we can
	  happily allow concurrent _atomic_ increments/decrements from
	  different tasks as long as there is no "use" i.e. read the
	  value and check it to make a decision if there is space or not
	  (this is only done by CPU targets).

So the main idea is that the rwlock when held as a reader permits these
"modifications without use" concurrently by task targets, but will block
a CPU target wishing to get a stable snapshot until that acquires the
rwlock as a writer.

The modifications done by task targets are done on atomic variables, so
we never loose any increments/decrements, but while these modifications
are going on, the global view of tsk_pinned may be inconsistent.
However, we know that once a CPU target acquires the rwlock as a writer,
there will be no more "readers" -- or rather any task targets that can
update tsk_pinned concurrently -- and therefore tsk_pinned must be
stable once we acquire the rwlock as a writer.

I'll have to think some more how to best update the comment...

> > +static DEFINE_RWLOCK(bp_cpuinfo_lock);
> > +
> > +/*
> > + * Synchronizes accesses to the per-task breakpoint list in task_bps_ht. Since
> > + * rhltable synchronizes concurrent insertions/deletions, independent tasks may
> > + * insert/delete concurrently; therefore, a mutex per task would be sufficient.
> > + *
> > + * To avoid bloating task_struct with infrequently used data, use a sharded
> > + * mutex that scales with number of CPUs.
> > + */
> > +static DEFINE_PER_CPU(struct mutex, task_sharded_mtx);
> > +
> > +static struct mutex *get_task_sharded_mtx(struct perf_event *bp)
> > +{
> > +       int shard;
> > +
> > +       if (!bp->hw.target)
> > +               return NULL;
> > +
> > +       /*
> > +        * Compute a valid shard index into per-CPU data.
> > +        */
> > +       shard = task_pid_nr(bp->hw.target) % nr_cpu_ids;
> > +       shard = cpumask_next(shard - 1, cpu_possible_mask);
> > +       if (shard >= nr_cpu_ids)
> > +               shard = cpumask_first(cpu_possible_mask);
> > +
> > +       return per_cpu_ptr(&task_sharded_mtx, shard);
> > +}
> > +
> > +static struct mutex *bp_constraints_lock(struct perf_event *bp)
> > +{
> > +       struct mutex *mtx = get_task_sharded_mtx(bp);
> > +
> > +       if (mtx) {
> > +               mutex_lock(mtx);
> > +               read_lock(&bp_cpuinfo_lock);
> 
> Is NR_CPUS == 1 case still important to optimize? I guess with small
> VMs it may be important again.
> If so, we could just write-lock bp_cpuinfo_lock always if NR_CPUS == 1.

Not sure, I guess it's easy to add the check for NR_CPUS==1.

[...]
> > @@ -397,12 +497,11 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
> >
> >  void release_bp_slot(struct perf_event *bp)
> >  {
> > -       mutex_lock(&nr_bp_mutex);
> > +       struct mutex *mtx = bp_constraints_lock(bp);
> >
> >         arch_unregister_hw_breakpoint(bp);
> 
> If I understand this correctly, this can weaken protection for
> arch_unregister_hw_breakpoint() and __modify_bp_slot(). Previously
> they were globally serialized, but now several calls can run in
> parallel. Is it OK?

__modify_bp_slot() just calls __release_bp_slot() and
__reserve_bp_slot() which is related to constraints accounting, and is
all internal to hw_breakpoint.

Only ppc overrides some of the sea arch_ functions. In arch/powerpc:
arch_unregister_hw_breakpoint() looks like it only accesses
bp->ctx->task, so that looks ok; however, looks like
arch_release_bp_slot() might want its own lock because it mutates a
list, but that lock wants to be in powerpc code.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints
  2022-06-09 12:04   ` Dmitry Vyukov
@ 2022-06-09 13:41     ` Dmitry Vyukov
  2022-06-09 14:00       ` Marco Elver
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 13:41 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 14:04, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> >
> > Flexible breakpoints have never been implemented, with
> > bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
> > bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
> > flexible count in fetch_bp_busy_slots().
> >
> > This again causes suboptimal code generation, when we always know that
> > `!!slots.flexible` will be 0.
> >
> > Just get rid of the flexible "placeholder" and remove all real code
> > related to it. Make a note in the comment related to the constraints
> > algorithm but don't remove them from the algorithm, so that if in future
> > flexible breakpoints need supporting, it should be trivial to revive
> > them (along with reverting this change).
> >
> > Signed-off-by: Marco Elver <elver@google.com>
>
> Was added in 2009.
>
> Acked-by: Dmitry Vyukov <dvyukov@google.com>
>
> > ---
> >  kernel/events/hw_breakpoint.c | 12 +++---------
> >  1 file changed, 3 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> > index 5f40c8dfa042..afe0a6007e96 100644
> > --- a/kernel/events/hw_breakpoint.c
> > +++ b/kernel/events/hw_breakpoint.c
> > @@ -46,8 +46,6 @@ struct bp_cpuinfo {
> >  #else
> >         unsigned int    *tsk_pinned;
> >  #endif
> > -       /* Number of non-pinned cpu/task breakpoints in a cpu */
> > -       unsigned int    flexible; /* XXX: placeholder, see fetch_this_slot() */
> >  };
> >
> >  static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
> > @@ -71,7 +69,6 @@ static bool constraints_initialized __ro_after_init;
> >  /* Gather the number of total pinned and un-pinned bp in a cpuset */
> >  struct bp_busy_slots {

Do we also want to remove this struct altogether? Now it becomes just
an int counter.

> >         unsigned int pinned;
> > -       unsigned int flexible;
> >  };
> >
> >  /* Serialize accesses to the above constraints */
> > @@ -213,10 +210,6 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct perf_event *bp,
> >
> >                 if (nr > slots->pinned)
> >                         slots->pinned = nr;
> > -
> > -               nr = info->flexible;
> > -               if (nr > slots->flexible)
> > -                       slots->flexible = nr;
> >         }
> >  }
> >
> > @@ -299,7 +292,8 @@ __weak void arch_unregister_hw_breakpoint(struct perf_event *bp)
> >  }
> >
> >  /*
> > - * Constraints to check before allowing this new breakpoint counter:
> > + * Constraints to check before allowing this new breakpoint counter. Note that
> > + * flexible breakpoints are currently unsupported -- see fetch_this_slot().
> >   *
> >   *  == Non-pinned counter == (Considered as pinned for now)
> >   *
> > @@ -366,7 +360,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
> >         fetch_this_slot(&slots, weight);
> >
> >         /* Flexible counters need to keep at least one slot */
> > -       if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
> > +       if (slots.pinned > hw_breakpoint_slots_cached(type))
> >                 return -ENOSPC;
> >
> >         ret = arch_reserve_bp_slot(bp);
> > --
> > 2.36.1.255.ge46751e96f-goog
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints
  2022-06-09 13:41     ` Dmitry Vyukov
@ 2022-06-09 14:00       ` Marco Elver
  0 siblings, 0 replies; 34+ messages in thread
From: Marco Elver @ 2022-06-09 14:00 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 15:41, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 14:04, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> > >
> > > Flexible breakpoints have never been implemented, with
> > > bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
> > > bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
> > > flexible count in fetch_bp_busy_slots().
> > >
> > > This again causes suboptimal code generation, when we always know that
> > > `!!slots.flexible` will be 0.
> > >
> > > Just get rid of the flexible "placeholder" and remove all real code
> > > related to it. Make a note in the comment related to the constraints
> > > algorithm but don't remove them from the algorithm, so that if in future
> > > flexible breakpoints need supporting, it should be trivial to revive
> > > them (along with reverting this change).
> > >
> > > Signed-off-by: Marco Elver <elver@google.com>
> >
> > Was added in 2009.
> >
> > Acked-by: Dmitry Vyukov <dvyukov@google.com>
> >
> > > ---
> > >  kernel/events/hw_breakpoint.c | 12 +++---------
> > >  1 file changed, 3 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> > > index 5f40c8dfa042..afe0a6007e96 100644
> > > --- a/kernel/events/hw_breakpoint.c
> > > +++ b/kernel/events/hw_breakpoint.c
> > > @@ -46,8 +46,6 @@ struct bp_cpuinfo {
> > >  #else
> > >         unsigned int    *tsk_pinned;
> > >  #endif
> > > -       /* Number of non-pinned cpu/task breakpoints in a cpu */
> > > -       unsigned int    flexible; /* XXX: placeholder, see fetch_this_slot() */
> > >  };
> > >
> > >  static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
> > > @@ -71,7 +69,6 @@ static bool constraints_initialized __ro_after_init;
> > >  /* Gather the number of total pinned and un-pinned bp in a cpuset */
> > >  struct bp_busy_slots {
>
> Do we also want to remove this struct altogether? Now it becomes just
> an int counter.

Yes, that actually can simplify a bunch of things, including
fetch_bp_busy_slots() just returning an int and fetch_this_slot() can
be removed (it'll be even cleaner if we remove the overridable
weight).

I'll simplify unless I hear objections.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
  2022-06-09 12:30   ` Dmitry Vyukov
@ 2022-06-09 14:29   ` Dmitry Vyukov
  2022-06-09 14:55     ` Marco Elver
  1 sibling, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 14:29 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> On a machine with 256 CPUs, running the recently added perf breakpoint
> benchmark results in:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 236.418 [sec]
>  |
>  |   123134.794271 usecs/op
>  |  7880626.833333 usecs/op/cpu
>
> The benchmark tests inherited breakpoint perf events across many
> threads.
>
> Looking at a perf profile, we can see that the majority of the time is
> spent in various hw_breakpoint.c functions, which execute within the
> 'nr_bp_mutex' critical sections which then results in contention on that
> mutex as well:
>
>     37.27%  [kernel]       [k] osq_lock
>     34.92%  [kernel]       [k] mutex_spin_on_owner
>     12.15%  [kernel]       [k] toggle_bp_slot
>     11.90%  [kernel]       [k] __reserve_bp_slot
>
> The culprit here is task_bp_pinned(), which has a runtime complexity of
> O(#tasks) due to storing all task breakpoints in the same list and
> iterating through that list looking for a matching task. Clearly, this
> does not scale to thousands of tasks.
>
> While one option would be to make task_struct a breakpoint list node,
> this would only further bloat task_struct for infrequently used data.

task_struct already has:

#ifdef CONFIG_PERF_EVENTS
  struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
  struct mutex perf_event_mutex;
  struct list_head perf_event_list;
#endif

Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
And possibly perf_event_list instead of task_bps_ht? It will contain
other perf_event types, so we will need to test type as well, but on
the positive side, we don't need any management of the separate
container.




> Instead, make use of the "rhashtable" variant "rhltable" which stores
> multiple items with the same key in a list. This results in average
> runtime complexity of O(1) for task_bp_pinned().
>
> With the optimization, the benchmark shows:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.208 [sec]
>  |
>  |      108.422396 usecs/op
>  |     6939.033333 usecs/op/cpu
>
> On this particular setup that's a speedup of ~1135x.
>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
>  include/linux/perf_event.h    |  3 +-
>  kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++-------------
>  2 files changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 01231f1d976c..e27360436dc6 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks {
>  };
>
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> +#include <linux/rhashtable-types.h>
>  #include <asm/hw_breakpoint.h>
>  #endif
>
> @@ -178,7 +179,7 @@ struct hw_perf_event {
>                          * creation and event initalization.
>                          */
>                         struct arch_hw_breakpoint       info;
> -                       struct list_head                bp_list;
> +                       struct rhlist_head              bp_list;
>                 };
>  #endif
>                 struct { /* amd_iommu */
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index f32320ac02fd..25c94c6e918d 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -28,7 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/init.h>
>  #include <linux/slab.h>
> -#include <linux/list.h>
> +#include <linux/rhashtable.h>
>  #include <linux/cpu.h>
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -55,7 +55,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
>  }
>
>  /* Keep track of the breakpoints attached to tasks */
> -static LIST_HEAD(bp_task_head);
> +static struct rhltable task_bps_ht;
> +static const struct rhashtable_params task_bps_ht_params = {
> +       .head_offset = offsetof(struct hw_perf_event, bp_list),
> +       .key_offset = offsetof(struct hw_perf_event, target),
> +       .key_len = sizeof_field(struct hw_perf_event, target),
> +       .automatic_shrinking = true,
> +};
>
>  static int constraints_initialized;
>
> @@ -104,17 +110,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
>   */
>  static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
>  {
> -       struct task_struct *tsk = bp->hw.target;
> +       struct rhlist_head *head, *pos;
>         struct perf_event *iter;
>         int count = 0;
>
> -       list_for_each_entry(iter, &bp_task_head, hw.bp_list) {
> -               if (iter->hw.target == tsk &&
> -                   find_slot_idx(iter->attr.bp_type) == type &&
> +       rcu_read_lock();
> +       head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params);
> +       if (!head)
> +               goto out;
> +
> +       rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
> +               if (find_slot_idx(iter->attr.bp_type) == type &&
>                     (iter->cpu < 0 || cpu == iter->cpu))
>                         count += hw_breakpoint_weight(iter);
>         }
>
> +out:
> +       rcu_read_unlock();
>         return count;
>  }
>
> @@ -187,7 +199,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
>  /*
>   * Add/remove the given breakpoint in our constraint table
>   */
> -static void
> +static int
>  toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>                int weight)
>  {
> @@ -200,7 +212,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>         /* Pinned counter cpu profiling */
>         if (!bp->hw.target) {
>                 get_bp_info(bp->cpu, type)->cpu_pinned += weight;
> -               return;
> +               return 0;
>         }
>
>         /* Pinned counter task profiling */
> @@ -208,9 +220,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>                 toggle_bp_task_slot(bp, cpu, type, weight);
>
>         if (enable)
> -               list_add_tail(&bp->hw.bp_list, &bp_task_head);
> +               return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
>         else
> -               list_del(&bp->hw.bp_list);
> +               return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
>  }
>
>  __weak int arch_reserve_bp_slot(struct perf_event *bp)
> @@ -308,9 +320,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
>         if (ret)
>                 return ret;
>
> -       toggle_bp_slot(bp, true, type, weight);
> -
> -       return 0;
> +       return toggle_bp_slot(bp, true, type, weight);
>  }
>
>  int reserve_bp_slot(struct perf_event *bp)
> @@ -335,7 +345,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type)
>
>         type = find_slot_idx(bp_type);
>         weight = hw_breakpoint_weight(bp);
> -       toggle_bp_slot(bp, false, type, weight);
> +       WARN_ON(toggle_bp_slot(bp, false, type, weight));
>  }
>
>  void release_bp_slot(struct perf_event *bp)
> @@ -679,7 +689,7 @@ static struct pmu perf_breakpoint = {
>  int __init init_hw_breakpoint(void)
>  {
>         int cpu, err_cpu;
> -       int i;
> +       int i, ret;
>
>         for (i = 0; i < TYPE_MAX; i++)
>                 nr_slots[i] = hw_breakpoint_slots(i);
> @@ -690,18 +700,24 @@ int __init init_hw_breakpoint(void)
>
>                         info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
>                                                         GFP_KERNEL);
> -                       if (!info->tsk_pinned)
> -                               goto err_alloc;
> +                       if (!info->tsk_pinned) {
> +                               ret = -ENOMEM;
> +                               goto err;
> +                       }
>                 }
>         }
>
> +       ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
> +       if (ret)
> +               goto err;
> +
>         constraints_initialized = 1;
>
>         perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);
>
>         return register_die_notifier(&hw_breakpoint_exceptions_nb);
>
> - err_alloc:
> +err:
>         for_each_possible_cpu(err_cpu) {
>                 for (i = 0; i < TYPE_MAX; i++)
>                         kfree(get_bp_info(err_cpu, i)->tsk_pinned);
> @@ -709,7 +725,5 @@ int __init init_hw_breakpoint(void)
>                         break;
>         }
>
> -       return -ENOMEM;
> +       return ret;
>  }
> -
> -
> --
> 2.36.1.255.ge46751e96f-goog
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 14:29   ` Dmitry Vyukov
@ 2022-06-09 14:55     ` Marco Elver
  2022-06-09 16:53       ` Dmitry Vyukov
  0 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 14:55 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 16:29, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> >
> > On a machine with 256 CPUs, running the recently added perf breakpoint
> > benchmark results in:
> >
> >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> >  | # Running 'breakpoint/thread' benchmark:
> >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> >  |      Total time: 236.418 [sec]
> >  |
> >  |   123134.794271 usecs/op
> >  |  7880626.833333 usecs/op/cpu
> >
> > The benchmark tests inherited breakpoint perf events across many
> > threads.
> >
> > Looking at a perf profile, we can see that the majority of the time is
> > spent in various hw_breakpoint.c functions, which execute within the
> > 'nr_bp_mutex' critical sections which then results in contention on that
> > mutex as well:
> >
> >     37.27%  [kernel]       [k] osq_lock
> >     34.92%  [kernel]       [k] mutex_spin_on_owner
> >     12.15%  [kernel]       [k] toggle_bp_slot
> >     11.90%  [kernel]       [k] __reserve_bp_slot
> >
> > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > O(#tasks) due to storing all task breakpoints in the same list and
> > iterating through that list looking for a matching task. Clearly, this
> > does not scale to thousands of tasks.
> >
> > While one option would be to make task_struct a breakpoint list node,
> > this would only further bloat task_struct for infrequently used data.
>
> task_struct already has:
>
> #ifdef CONFIG_PERF_EVENTS
>   struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
>   struct mutex perf_event_mutex;
>   struct list_head perf_event_list;
> #endif
>
> Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> And possibly perf_event_list instead of task_bps_ht? It will contain
> other perf_event types, so we will need to test type as well, but on
> the positive side, we don't need any management of the separate
> container.

Hmm, yes, I looked at that but then decided against messing the
perf/core internals. The main issue I have with using perf_event_mutex
is that we might interfere with perf/core's locking rules as well as
interfere with other concurrent perf event additions. Using
perf_event_list is very likely a no-go because it requires reworking
perf/core as well.

I can already hear Peter shouting, but maybe I'm wrong. :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
@ 2022-06-09 15:00   ` Dmitry Vyukov
  2022-06-10  8:25     ` Marco Elver
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 15:00 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
>
> Running the perf benchmark with (note: more aggressive parameters vs.
> preceding changes, but same host with 256 CPUs):
>
>  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
>  |      Total time: 1.953 [sec]
>  |
>  |       38.146289 usecs/op
>  |     4882.725000 usecs/op/cpu
>
>     16.29%  [kernel]       [k] rhashtable_jhash2
>     16.19%  [kernel]       [k] osq_lock
>     14.22%  [kernel]       [k] queued_spin_lock_slowpath
>      8.58%  [kernel]       [k] task_bp_pinned
>      8.30%  [kernel]       [k] mutex_spin_on_owner
>      4.03%  [kernel]       [k] smp_cfm_core_cond
>      2.97%  [kernel]       [k] toggle_bp_slot
>      2.94%  [kernel]       [k] bcmp
>
> We can see that a majority of the time is now spent hashing task
> pointers to index into task_bps_ht in task_bp_pinned().
>
> However, if task_bp_pinned()'s computation is independent of any CPU,
> i.e. always `iter->cpu < 0`, the result for each invocation will be
> identical. With increasing CPU-count, this problem worsens.
>
> Instead, identify if every call to task_bp_pinned() is CPU-independent,
> and cache the result. Use the cached result instead of a call to
> task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> if the cached result can be used.
>
> After this optimization:
>
>     21.96%  [kernel]       [k] queued_spin_lock_slowpath
>     16.39%  [kernel]       [k] osq_lock
>      9.82%  [kernel]       [k] toggle_bp_slot
>      9.81%  [kernel]       [k] find_next_bit
>      4.93%  [kernel]       [k] mutex_spin_on_owner
>      4.71%  [kernel]       [k] smp_cfm_core_cond
>      4.30%  [kernel]       [k] __reserve_bp_slot
>      2.65%  [kernel]       [k] cpumask_next
>
> Showing that the time spent hashing keys has become insignificant.
>
> With the given benchmark parameters, however, we see no statistically
> significant improvement in performance on the test system with 256 CPUs.
> This is very likely due to the benchmark parameters being too aggressive
> and contention elsewhere becoming dominant.
>
> Indeed, when using the less aggressive parameters from the preceding
> changes, we now observe:
>
>  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
>  | # Running 'breakpoint/thread' benchmark:
>  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
>  |      Total time: 0.071 [sec]
>  |
>  |       37.134896 usecs/op
>  |     2376.633333 usecs/op/cpu
>
> Which is an improvement of 12% compared to without this optimization
> (baseline is 42 usecs/op). This is now only 5% slower than the
> theoretical ideal (constraints disabled), and 18% slower than no
> breakpoints at all.
>
> [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
>   more consistent with other functions (which have bp first, before the
>   cpu argument). ]

There are 3 main cases:
1. Per-cpu bp.
2. Per-task and per-cpu bp.
3. Per-task bp (on all cpus)
right?

For case 1 we still seem to do lots of unnecessary work in
fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
only the CPU's cpu_pinned, so that's the only CPU we need to
fetch/check.

For case 2 we also do lots of unnecessary work, again we also need to
check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
atomic_dec/inc on all other CPUs (they dec/inc the same variable).

Case 3 is the only one when we need to check all CPUs and
cached_tbp_pinned may be useful.
But I wonder if we could instead add a per-task
has_per_cpu_breakpoints flag. Then if the flag is set, we check all
CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
then we could optimize the code even more by making it O(1) instead of
O(N). Namely, we add global tsk_pinned for tasks that don't have
per-cpu breakpoints, and we update only that tsk_pinned instead of
iterating over all CPUs.
I think this will require adding cpu_pinned as well (similar to
tsk_pinned but aggregated over all CPUs).
Then the fast path capacity check can become just:

if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
  if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
hw_breakpoint_weight(bp) > nr_slots[type])
    return -ENOSPC;
}

Does it make any sense?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 14:55     ` Marco Elver
@ 2022-06-09 16:53       ` Dmitry Vyukov
  2022-06-09 18:37         ` Marco Elver
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-09 16:53 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

.
/On Thu, 9 Jun 2022 at 16:56, Marco Elver <elver@google.com> wrote:
> > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > benchmark results in:
> > >
> > >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > >  | # Running 'breakpoint/thread' benchmark:
> > >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > >  |      Total time: 236.418 [sec]
> > >  |
> > >  |   123134.794271 usecs/op
> > >  |  7880626.833333 usecs/op/cpu
> > >
> > > The benchmark tests inherited breakpoint perf events across many
> > > threads.
> > >
> > > Looking at a perf profile, we can see that the majority of the time is
> > > spent in various hw_breakpoint.c functions, which execute within the
> > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > mutex as well:
> > >
> > >     37.27%  [kernel]       [k] osq_lock
> > >     34.92%  [kernel]       [k] mutex_spin_on_owner
> > >     12.15%  [kernel]       [k] toggle_bp_slot
> > >     11.90%  [kernel]       [k] __reserve_bp_slot
> > >
> > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > O(#tasks) due to storing all task breakpoints in the same list and
> > > iterating through that list looking for a matching task. Clearly, this
> > > does not scale to thousands of tasks.
> > >
> > > While one option would be to make task_struct a breakpoint list node,
> > > this would only further bloat task_struct for infrequently used data.
> >
> > task_struct already has:
> >
> > #ifdef CONFIG_PERF_EVENTS
> >   struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> >   struct mutex perf_event_mutex;
> >   struct list_head perf_event_list;
> > #endif
> >
> > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > And possibly perf_event_list instead of task_bps_ht? It will contain
> > other perf_event types, so we will need to test type as well, but on
> > the positive side, we don't need any management of the separate
> > container.
>
> Hmm, yes, I looked at that but then decided against messing the
> perf/core internals. The main issue I have with using perf_event_mutex
> is that we might interfere with perf/core's locking rules as well as
> interfere with other concurrent perf event additions. Using
> perf_event_list is very likely a no-go because it requires reworking
> perf/core as well.
>
> I can already hear Peter shouting, but maybe I'm wrong. :-)

Let's wait for Peter to shout then :)
A significant part of this change is having per-task data w/o having
per-task data.

The current perf-related data in task_struct is already multiple words
and it's also not used in lots of production cases.
Maybe we could have something like:

  struct perf_task_data* lazily_allocated_perf_data;

that's lazily allocated on first use instead of the current
perf_event_ctxp/perf_event_mutex/perf_event_list.
This way we could both reduce task_size when perf is not used and have
more perf-related data (incl breakpoints) when it's used.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 16:53       ` Dmitry Vyukov
@ 2022-06-09 18:37         ` Marco Elver
  2022-06-10  9:04           ` Dmitry Vyukov
  0 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-09 18:37 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 18:53, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> .
> /On Thu, 9 Jun 2022 at 16:56, Marco Elver <elver@google.com> wrote:
> > > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > > benchmark results in:
> > > >
> > > >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > >  | # Running 'breakpoint/thread' benchmark:
> > > >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > >  |      Total time: 236.418 [sec]
> > > >  |
> > > >  |   123134.794271 usecs/op
> > > >  |  7880626.833333 usecs/op/cpu
> > > >
> > > > The benchmark tests inherited breakpoint perf events across many
> > > > threads.
> > > >
> > > > Looking at a perf profile, we can see that the majority of the time is
> > > > spent in various hw_breakpoint.c functions, which execute within the
> > > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > > mutex as well:
> > > >
> > > >     37.27%  [kernel]       [k] osq_lock
> > > >     34.92%  [kernel]       [k] mutex_spin_on_owner
> > > >     12.15%  [kernel]       [k] toggle_bp_slot
> > > >     11.90%  [kernel]       [k] __reserve_bp_slot
> > > >
> > > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > > O(#tasks) due to storing all task breakpoints in the same list and
> > > > iterating through that list looking for a matching task. Clearly, this
> > > > does not scale to thousands of tasks.
> > > >
> > > > While one option would be to make task_struct a breakpoint list node,
> > > > this would only further bloat task_struct for infrequently used data.
> > >
> > > task_struct already has:
> > >
> > > #ifdef CONFIG_PERF_EVENTS
> > >   struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > >   struct mutex perf_event_mutex;
> > >   struct list_head perf_event_list;
> > > #endif
> > >
> > > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > > And possibly perf_event_list instead of task_bps_ht? It will contain
> > > other perf_event types, so we will need to test type as well, but on
> > > the positive side, we don't need any management of the separate
> > > container.
> >
> > Hmm, yes, I looked at that but then decided against messing the
> > perf/core internals. The main issue I have with using perf_event_mutex
> > is that we might interfere with perf/core's locking rules as well as
> > interfere with other concurrent perf event additions. Using
> > perf_event_list is very likely a no-go because it requires reworking
> > perf/core as well.
> >
> > I can already hear Peter shouting, but maybe I'm wrong. :-)
>
> Let's wait for Peter to shout then :)
> A significant part of this change is having per-task data w/o having
> per-task data.
>
> The current perf-related data in task_struct is already multiple words
> and it's also not used in lots of production cases.
> Maybe we could have something like:
>
>   struct perf_task_data* lazily_allocated_perf_data;
>
> that's lazily allocated on first use instead of the current
> perf_event_ctxp/perf_event_mutex/perf_event_list.
> This way we could both reduce task_size when perf is not used and have
> more perf-related data (incl breakpoints) when it's used.

I don't mind either option, so keeping task_struct bloat in mind, we have:

  1. rhashtable option, no changes to task_struct.

  2. add the breakpoint mutex + list to task_struct.

  3. add something like hw_breakpoint_task_data* and allocate lazily.

  4. (your proposal) move all of perf data into a new struct (+add
hw_breakpoint things in there) that is lazily allocated.

I don't think perf is that infrequently used, and I can't estimate
performance impact, so I don't like #4 too much personally. My
preferred compromise would be #3, but at the same time I'd rather not
bloat task_struct even with 8 extra infrequently used bytes. Am I too
paranoid?

Preferences?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  2022-06-09 15:00   ` Dmitry Vyukov
@ 2022-06-10  8:25     ` Marco Elver
  2022-06-10  9:13       ` Dmitry Vyukov
  0 siblings, 1 reply; 34+ messages in thread
From: Marco Elver @ 2022-06-10  8:25 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
<kasan-dev@googlegroups.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> >
> > Running the perf benchmark with (note: more aggressive parameters vs.
> > preceding changes, but same host with 256 CPUs):
> >
> >  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> >  | # Running 'breakpoint/thread' benchmark:
> >  | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> >  |      Total time: 1.953 [sec]
> >  |
> >  |       38.146289 usecs/op
> >  |     4882.725000 usecs/op/cpu
> >
> >     16.29%  [kernel]       [k] rhashtable_jhash2
> >     16.19%  [kernel]       [k] osq_lock
> >     14.22%  [kernel]       [k] queued_spin_lock_slowpath
> >      8.58%  [kernel]       [k] task_bp_pinned
> >      8.30%  [kernel]       [k] mutex_spin_on_owner
> >      4.03%  [kernel]       [k] smp_cfm_core_cond
> >      2.97%  [kernel]       [k] toggle_bp_slot
> >      2.94%  [kernel]       [k] bcmp
> >
> > We can see that a majority of the time is now spent hashing task
> > pointers to index into task_bps_ht in task_bp_pinned().
> >
> > However, if task_bp_pinned()'s computation is independent of any CPU,
> > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > identical. With increasing CPU-count, this problem worsens.
> >
> > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > and cache the result. Use the cached result instead of a call to
> > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > if the cached result can be used.
> >
> > After this optimization:
> >
> >     21.96%  [kernel]       [k] queued_spin_lock_slowpath
> >     16.39%  [kernel]       [k] osq_lock
> >      9.82%  [kernel]       [k] toggle_bp_slot
> >      9.81%  [kernel]       [k] find_next_bit
> >      4.93%  [kernel]       [k] mutex_spin_on_owner
> >      4.71%  [kernel]       [k] smp_cfm_core_cond
> >      4.30%  [kernel]       [k] __reserve_bp_slot
> >      2.65%  [kernel]       [k] cpumask_next
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, however, we see no statistically
> > significant improvement in performance on the test system with 256 CPUs.
> > This is very likely due to the benchmark parameters being too aggressive
> > and contention elsewhere becoming dominant.
> >
> > Indeed, when using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> >  | # Running 'breakpoint/thread' benchmark:
> >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> >  |      Total time: 0.071 [sec]
> >  |
> >  |       37.134896 usecs/op
> >  |     2376.633333 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without this optimization
> > (baseline is 42 usecs/op). This is now only 5% slower than the
> > theoretical ideal (constraints disabled), and 18% slower than no
> > breakpoints at all.
> >
> > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> >   more consistent with other functions (which have bp first, before the
> >   cpu argument). ]
>
> There are 3 main cases:
> 1. Per-cpu bp.

Yes, CPU-target breakpoint on just 1 CPU.

> 2. Per-task and per-cpu bp.

Task-target breakpoint but pinned to 1 CPU.

> 3. Per-task bp (on all cpus)

Task-target breakpoint that can run on all CPUs.

> right?
>
> For case 1 we still seem to do lots of unnecessary work in
> fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> only the CPU's cpu_pinned, so that's the only CPU we need to
> fetch/check.

It'll just do 1 iteration, because cpumask_of_bp() will return a mask
with just the event's target CPU in it.

> For case 2 we also do lots of unnecessary work, again we also need to
> check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> atomic_dec/inc on all other CPUs (they dec/inc the same variable).

Same as above, just 1 iteration because cpumask_of_bp() does the right
thing. cached_tbp_pinned may still be used if all existing task
breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
task_bp_pinned() once or twice probably is irrelevant in this case).

> Case 3 is the only one when we need to check all CPUs and
> cached_tbp_pinned may be useful.
> But I wonder if we could instead add a per-task
> has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> then we could optimize the code even more by making it O(1) instead of
> O(N).

> Namely, we add global tsk_pinned for tasks that don't have
> per-cpu breakpoints, and we update only that tsk_pinned instead of
> iterating over all CPUs.

That seems reasonable.

> I think this will require adding cpu_pinned as well (similar to
> tsk_pinned but aggregated over all CPUs).

> Then the fast path capacity check can become just:
>
> if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
>   if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> hw_breakpoint_weight(bp) > nr_slots[type])
>     return -ENOSPC;
> }
>
> Does it make any sense?

Yes, I think this might work. I'll see if I can make it work.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-09 18:37         ` Marco Elver
@ 2022-06-10  9:04           ` Dmitry Vyukov
  2022-06-10  9:36             ` Marco Elver
  0 siblings, 1 reply; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-10  9:04 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Thu, 9 Jun 2022 at 20:37, Marco Elver <elver@google.com> wrote:
> > /On Thu, 9 Jun 2022 at 16:56, Marco Elver <elver@google.com> wrote:
> > > > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > > > benchmark results in:
> > > > >
> > > > >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > > >  | # Running 'breakpoint/thread' benchmark:
> > > > >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > > >  |      Total time: 236.418 [sec]
> > > > >  |
> > > > >  |   123134.794271 usecs/op
> > > > >  |  7880626.833333 usecs/op/cpu
> > > > >
> > > > > The benchmark tests inherited breakpoint perf events across many
> > > > > threads.
> > > > >
> > > > > Looking at a perf profile, we can see that the majority of the time is
> > > > > spent in various hw_breakpoint.c functions, which execute within the
> > > > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > > > mutex as well:
> > > > >
> > > > >     37.27%  [kernel]       [k] osq_lock
> > > > >     34.92%  [kernel]       [k] mutex_spin_on_owner
> > > > >     12.15%  [kernel]       [k] toggle_bp_slot
> > > > >     11.90%  [kernel]       [k] __reserve_bp_slot
> > > > >
> > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > > > O(#tasks) due to storing all task breakpoints in the same list and
> > > > > iterating through that list looking for a matching task. Clearly, this
> > > > > does not scale to thousands of tasks.
> > > > >
> > > > > While one option would be to make task_struct a breakpoint list node,
> > > > > this would only further bloat task_struct for infrequently used data.
> > > >
> > > > task_struct already has:
> > > >
> > > > #ifdef CONFIG_PERF_EVENTS
> > > >   struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > > >   struct mutex perf_event_mutex;
> > > >   struct list_head perf_event_list;
> > > > #endif
> > > >
> > > > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > > > And possibly perf_event_list instead of task_bps_ht? It will contain
> > > > other perf_event types, so we will need to test type as well, but on
> > > > the positive side, we don't need any management of the separate
> > > > container.
> > >
> > > Hmm, yes, I looked at that but then decided against messing the
> > > perf/core internals. The main issue I have with using perf_event_mutex
> > > is that we might interfere with perf/core's locking rules as well as
> > > interfere with other concurrent perf event additions. Using
> > > perf_event_list is very likely a no-go because it requires reworking
> > > perf/core as well.
> > >
> > > I can already hear Peter shouting, but maybe I'm wrong. :-)
> >
> > Let's wait for Peter to shout then :)
> > A significant part of this change is having per-task data w/o having
> > per-task data.
> >
> > The current perf-related data in task_struct is already multiple words
> > and it's also not used in lots of production cases.
> > Maybe we could have something like:
> >
> >   struct perf_task_data* lazily_allocated_perf_data;
> >
> > that's lazily allocated on first use instead of the current
> > perf_event_ctxp/perf_event_mutex/perf_event_list.
> > This way we could both reduce task_size when perf is not used and have
> > more perf-related data (incl breakpoints) when it's used.
>
> I don't mind either option, so keeping task_struct bloat in mind, we have:
>
>   1. rhashtable option, no changes to task_struct.
>
>   2. add the breakpoint mutex + list to task_struct.
>
>   3. add something like hw_breakpoint_task_data* and allocate lazily.
>
>   4. (your proposal) move all of perf data into a new struct (+add
> hw_breakpoint things in there) that is lazily allocated.
>
> I don't think perf is that infrequently used, and I can't estimate
> performance impact, so I don't like #4 too much personally. My
> preferred compromise would be #3, but at the same time I'd rather not
> bloat task_struct even with 8 extra infrequently used bytes. Am I too
> paranoid?
>
> Preferences?


There is also this "could eventually get its own" comment:

static struct pmu perf_breakpoint = {
  .task_ctx_nr = perf_sw_context, /* could eventually get its own */
https://elixir.bootlin.com/linux/v5.19-rc1/source/kernel/events/hw_breakpoint.c#L669

If it gets its own, then it also gets a perf_event_context pointer in
task_struct:
https://elixir.bootlin.com/linux/v5.19-rc1/source/include/linux/sched.h#L1229
And perf_event_context has its own mutex and lots of other stuff.
But I don't know what other implications it has.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
  2022-06-10  8:25     ` Marco Elver
@ 2022-06-10  9:13       ` Dmitry Vyukov
  0 siblings, 0 replies; 34+ messages in thread
From: Dmitry Vyukov @ 2022-06-10  9:13 UTC (permalink / raw)
  To: Marco Elver
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Fri, 10 Jun 2022 at 10:25, Marco Elver <elver@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
> <kasan-dev@googlegroups.com> wrote:
> >
> > On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@google.com> wrote:
> > >
> > > Running the perf benchmark with (note: more aggressive parameters vs.
> > > preceding changes, but same host with 256 CPUs):
> > >
> > >  | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> > >  | # Running 'breakpoint/thread' benchmark:
> > >  | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> > >  |      Total time: 1.953 [sec]
> > >  |
> > >  |       38.146289 usecs/op
> > >  |     4882.725000 usecs/op/cpu
> > >
> > >     16.29%  [kernel]       [k] rhashtable_jhash2
> > >     16.19%  [kernel]       [k] osq_lock
> > >     14.22%  [kernel]       [k] queued_spin_lock_slowpath
> > >      8.58%  [kernel]       [k] task_bp_pinned
> > >      8.30%  [kernel]       [k] mutex_spin_on_owner
> > >      4.03%  [kernel]       [k] smp_cfm_core_cond
> > >      2.97%  [kernel]       [k] toggle_bp_slot
> > >      2.94%  [kernel]       [k] bcmp
> > >
> > > We can see that a majority of the time is now spent hashing task
> > > pointers to index into task_bps_ht in task_bp_pinned().
> > >
> > > However, if task_bp_pinned()'s computation is independent of any CPU,
> > > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > > identical. With increasing CPU-count, this problem worsens.
> > >
> > > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > > and cache the result. Use the cached result instead of a call to
> > > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > > if the cached result can be used.
> > >
> > > After this optimization:
> > >
> > >     21.96%  [kernel]       [k] queued_spin_lock_slowpath
> > >     16.39%  [kernel]       [k] osq_lock
> > >      9.82%  [kernel]       [k] toggle_bp_slot
> > >      9.81%  [kernel]       [k] find_next_bit
> > >      4.93%  [kernel]       [k] mutex_spin_on_owner
> > >      4.71%  [kernel]       [k] smp_cfm_core_cond
> > >      4.30%  [kernel]       [k] __reserve_bp_slot
> > >      2.65%  [kernel]       [k] cpumask_next
> > >
> > > Showing that the time spent hashing keys has become insignificant.
> > >
> > > With the given benchmark parameters, however, we see no statistically
> > > significant improvement in performance on the test system with 256 CPUs.
> > > This is very likely due to the benchmark parameters being too aggressive
> > > and contention elsewhere becoming dominant.
> > >
> > > Indeed, when using the less aggressive parameters from the preceding
> > > changes, we now observe:
> > >
> > >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > >  | # Running 'breakpoint/thread' benchmark:
> > >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > >  |      Total time: 0.071 [sec]
> > >  |
> > >  |       37.134896 usecs/op
> > >  |     2376.633333 usecs/op/cpu
> > >
> > > Which is an improvement of 12% compared to without this optimization
> > > (baseline is 42 usecs/op). This is now only 5% slower than the
> > > theoretical ideal (constraints disabled), and 18% slower than no
> > > breakpoints at all.
> > >
> > > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> > >   more consistent with other functions (which have bp first, before the
> > >   cpu argument). ]
> >
> > There are 3 main cases:
> > 1. Per-cpu bp.
>
> Yes, CPU-target breakpoint on just 1 CPU.
>
> > 2. Per-task and per-cpu bp.
>
> Task-target breakpoint but pinned to 1 CPU.
>
> > 3. Per-task bp (on all cpus)
>
> Task-target breakpoint that can run on all CPUs.
>
> > right?
> >
> > For case 1 we still seem to do lots of unnecessary work in
> > fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> > only the CPU's cpu_pinned, so that's the only CPU we need to
> > fetch/check.
>
> It'll just do 1 iteration, because cpumask_of_bp() will return a mask
> with just the event's target CPU in it.

You are right. I missed the use of cpumask_of_bp().

> > For case 2 we also do lots of unnecessary work, again we also need to
> > check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> > atomic_dec/inc on all other CPUs (they dec/inc the same variable).
>
> Same as above, just 1 iteration because cpumask_of_bp() does the right
> thing. cached_tbp_pinned may still be used if all existing task
> breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
> task_bp_pinned() once or twice probably is irrelevant in this case).
>
> > Case 3 is the only one when we need to check all CPUs and
> > cached_tbp_pinned may be useful.
> > But I wonder if we could instead add a per-task
> > has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> > CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> > then we could optimize the code even more by making it O(1) instead of
> > O(N).
>
> > Namely, we add global tsk_pinned for tasks that don't have
> > per-cpu breakpoints, and we update only that tsk_pinned instead of
> > iterating over all CPUs.
>
> That seems reasonable.
>
> > I think this will require adding cpu_pinned as well (similar to
> > tsk_pinned but aggregated over all CPUs).
>
> > Then the fast path capacity check can become just:
> >
> > if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
> >   if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> > hw_breakpoint_weight(bp) > nr_slots[type])
> >     return -ENOSPC;
> > }
> >
> > Does it make any sense?
>
> Yes, I think this might work. I'll see if I can make it work.

Actually!
This is somewhat orthogonal to the optimizations you are doing, but
the most interesting case for us is inherited events. And it seems
that an inherited event can't possibly overflow the capacity.
Inherited events are a subset of the parent events and all parent
events have already passed validation and the child can't have its own
new events when inherited events are created.
So couldn't we somehow detect that reserve_bp_slot() is called from
inherit_event() and skip fetch_bp_busy_slots() altogether? Maybe that
can be detected by looking at bp->attr.inherit and presence of parent
context? Capacity validation may be kept as a debug-only check.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints
  2022-06-10  9:04           ` Dmitry Vyukov
@ 2022-06-10  9:36             ` Marco Elver
  0 siblings, 0 replies; 34+ messages in thread
From: Marco Elver @ 2022-06-10  9:36 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	x86, linux-sh, kasan-dev, linux-kernel

On Fri, 10 Jun 2022 at 11:04, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, 9 Jun 2022 at 20:37, Marco Elver <elver@google.com> wrote:
> > > /On Thu, 9 Jun 2022 at 16:56, Marco Elver <elver@google.com> wrote:
> > > > > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > > > > benchmark results in:
> > > > > >
> > > > > >  | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > > > >  | # Running 'breakpoint/thread' benchmark:
> > > > > >  | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > > > >  |      Total time: 236.418 [sec]
> > > > > >  |
> > > > > >  |   123134.794271 usecs/op
> > > > > >  |  7880626.833333 usecs/op/cpu
> > > > > >
> > > > > > The benchmark tests inherited breakpoint perf events across many
> > > > > > threads.
> > > > > >
> > > > > > Looking at a perf profile, we can see that the majority of the time is
> > > > > > spent in various hw_breakpoint.c functions, which execute within the
> > > > > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > > > > mutex as well:
> > > > > >
> > > > > >     37.27%  [kernel]       [k] osq_lock
> > > > > >     34.92%  [kernel]       [k] mutex_spin_on_owner
> > > > > >     12.15%  [kernel]       [k] toggle_bp_slot
> > > > > >     11.90%  [kernel]       [k] __reserve_bp_slot
> > > > > >
> > > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > > > > O(#tasks) due to storing all task breakpoints in the same list and
> > > > > > iterating through that list looking for a matching task. Clearly, this
> > > > > > does not scale to thousands of tasks.
> > > > > >
> > > > > > While one option would be to make task_struct a breakpoint list node,
> > > > > > this would only further bloat task_struct for infrequently used data.
> > > > >
> > > > > task_struct already has:
> > > > >
> > > > > #ifdef CONFIG_PERF_EVENTS
> > > > >   struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > > > >   struct mutex perf_event_mutex;
> > > > >   struct list_head perf_event_list;
> > > > > #endif
> > > > >
> > > > > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > > > > And possibly perf_event_list instead of task_bps_ht? It will contain
> > > > > other perf_event types, so we will need to test type as well, but on
> > > > > the positive side, we don't need any management of the separate
> > > > > container.
> > > >
> > > > Hmm, yes, I looked at that but then decided against messing the
> > > > perf/core internals. The main issue I have with using perf_event_mutex
> > > > is that we might interfere with perf/core's locking rules as well as
> > > > interfere with other concurrent perf event additions. Using
> > > > perf_event_list is very likely a no-go because it requires reworking
> > > > perf/core as well.
> > > >
> > > > I can already hear Peter shouting, but maybe I'm wrong. :-)
> > >
> > > Let's wait for Peter to shout then :)
> > > A significant part of this change is having per-task data w/o having
> > > per-task data.
> > >
> > > The current perf-related data in task_struct is already multiple words
> > > and it's also not used in lots of production cases.
> > > Maybe we could have something like:
> > >
> > >   struct perf_task_data* lazily_allocated_perf_data;
> > >
> > > that's lazily allocated on first use instead of the current
> > > perf_event_ctxp/perf_event_mutex/perf_event_list.
> > > This way we could both reduce task_size when perf is not used and have
> > > more perf-related data (incl breakpoints) when it's used.
> >
> > I don't mind either option, so keeping task_struct bloat in mind, we have:
> >
> >   1. rhashtable option, no changes to task_struct.
> >
> >   2. add the breakpoint mutex + list to task_struct.
> >
> >   3. add something like hw_breakpoint_task_data* and allocate lazily.
> >
> >   4. (your proposal) move all of perf data into a new struct (+add
> > hw_breakpoint things in there) that is lazily allocated.
> >
> > I don't think perf is that infrequently used, and I can't estimate
> > performance impact, so I don't like #4 too much personally. My
> > preferred compromise would be #3, but at the same time I'd rather not
> > bloat task_struct even with 8 extra infrequently used bytes. Am I too
> > paranoid?
> >
> > Preferences?
>
>
> There is also this "could eventually get its own" comment:
>
> static struct pmu perf_breakpoint = {
>   .task_ctx_nr = perf_sw_context, /* could eventually get its own */
> https://elixir.bootlin.com/linux/v5.19-rc1/source/kernel/events/hw_breakpoint.c#L669
>
> If it gets its own, then it also gets a perf_event_context pointer in
> task_struct:
> https://elixir.bootlin.com/linux/v5.19-rc1/source/include/linux/sched.h#L1229
> And perf_event_context has its own mutex and lots of other stuff.
> But I don't know what other implications it has.

Relying on perf events to be the only way that instantiates
breakpoints does not work, because hw_breakpoint is also used by
ptrace independently.

On a whole, adding lazily allocated data to task_struct is not as
simple as the rhashtable option (need to take care of  fork and exit
and make sure the lazily allocated data lives long enough etc.). I
question the added complexity vs. the benefit, when using the
rhashtable avoids all that. If I get rid of the O(#cpu) loops it also
doesn't show up in profiles anymore and any efforts to optimize here
are not buying us much in terms of performance.

If the main issue is the mutex, I suppose we can find a hole in
task_struct and stick it there (there's a massive 32-byte hole above
task_struct::stats).

Was the mutex the only benefit?

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-06-10  9:38 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-09 11:30 [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Marco Elver
2022-06-09 11:30 ` [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints Marco Elver
2022-06-09 12:30   ` Dmitry Vyukov
2022-06-09 12:53     ` Marco Elver
2022-06-09 13:05       ` Dmitry Vyukov
2022-06-09 14:29   ` Dmitry Vyukov
2022-06-09 14:55     ` Marco Elver
2022-06-09 16:53       ` Dmitry Vyukov
2022-06-09 18:37         ` Marco Elver
2022-06-10  9:04           ` Dmitry Vyukov
2022-06-10  9:36             ` Marco Elver
2022-06-09 11:30 ` [PATCH 2/8] perf/hw_breakpoint: Mark data __ro_after_init Marco Elver
2022-06-09 11:45   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 3/8] perf/hw_breakpoint: Optimize constant number of breakpoint slots Marco Elver
2022-06-09 11:55   ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 4/8] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable Marco Elver
2022-06-09 12:03   ` Dmitry Vyukov
2022-06-09 12:08     ` Marco Elver
2022-06-09 12:23       ` Dmitry Vyukov
2022-06-09 13:25     ` Peter Zijlstra
2022-06-09 11:30 ` [PATCH 5/8] perf/hw_breakpoint: Remove useless code related to flexible breakpoints Marco Elver
2022-06-09 12:04   ` Dmitry Vyukov
2022-06-09 13:41     ` Dmitry Vyukov
2022-06-09 14:00       ` Marco Elver
2022-06-09 11:30 ` [PATCH 6/8] perf/hw_breakpoint: Reduce contention with large number of tasks Marco Elver
2022-06-09 13:03   ` Dmitry Vyukov
2022-06-09 13:29     ` Marco Elver
2022-06-09 11:30 ` [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent Marco Elver
2022-06-09 15:00   ` Dmitry Vyukov
2022-06-10  8:25     ` Marco Elver
2022-06-10  9:13       ` Dmitry Vyukov
2022-06-09 11:30 ` [PATCH 8/8] perf/hw_breakpoint: Clean up headers Marco Elver
2022-06-09 12:11   ` Dmitry Vyukov
2022-06-09 12:28 ` [PATCH 0/8] perf/hw_breakpoint: Optimize for thousands of tasks Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.