linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/10] Optimize cgroup context switch
@ 2019-11-14  0:30 Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
                   ` (12 more replies)
  0 siblings, 13 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Avoid iterating over all per-CPU events during cgroup changing context
switches by organizing events by cgroup.

To make an efficient set of iterators, introduce a min max heap
utility with test.

These patches include a caching algorithm to improve the search for
the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
well as rebasing hit "optimize event_filter_match during sched_in"
from https://lkml.org/lkml/2019/8/7/771.

The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git

These patches follow Peter's reorganization and his fixes to the
perf_cpu_context min_heap storage code.

Ian Rogers (8):
  lib: introduce generic min max heap
  perf: Use min_max_heap in visit_groups_merge
  perf: Add per perf_cpu_context min_heap storage
  perf/cgroup: Grow per perf_cpu_context heap storage
  perf/cgroup: Order events in RB tree by cgroup id
  perf: simplify and rename visit_groups_merge
  perf: cache perf_event_groups_first for cgroups
  perf: optimize event_filter_match during sched_in

Kan Liang (1):
  perf/cgroup: Do not switch system-wide events in cgroup switch

Peter Zijlstra (1):
  perf/cgroup: Reorder perf_cgroup_connect()

 include/linux/min_max_heap.h | 134 +++++++++
 include/linux/perf_event.h   |  14 +
 kernel/events/core.c         | 512 ++++++++++++++++++++++++++++-------
 lib/Kconfig.debug            |  10 +
 lib/Makefile                 |   1 +
 lib/test_min_max_heap.c      | 194 +++++++++++++
 6 files changed, 769 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/min_max_heap.h
 create mode 100644 lib/test_min_max_heap.c

-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect()
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  8:50   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfd89b4a02d8..0dce28b0aae0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10597,12 +10597,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
-
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
@@ -10615,6 +10609,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_pmu;
 	}
 
+	if (cgroup_fd != -1) {
+		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+		if (err)
+			goto err_pmu;
+	}
+
 	err = exclusive_event_init(event);
 	if (err)
 		goto err_pmu;
@@ -10675,12 +10675,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	if (event->hw.target)
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  9:32   ` Peter Zijlstra
                     ` (2 more replies)
  2019-11-14  0:30 ` [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/min_max_heap.h | 134 ++++++++++++++++++++++++
 lib/Kconfig.debug            |  10 ++
 lib/Makefile                 |   1 +
 lib/test_min_max_heap.c      | 194 +++++++++++++++++++++++++++++++++++
 4 files changed, 339 insertions(+)
 create mode 100644 include/linux/min_max_heap.h
 create mode 100644 lib/test_min_max_heap.c

diff --git a/include/linux/min_max_heap.h b/include/linux/min_max_heap.h
new file mode 100644
index 000000000000..ea7764a8252a
--- /dev/null
+++ b/include/linux/min_max_heap.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MIN_MAX_HEAP_H
+#define _LINUX_MIN_MAX_HEAP_H
+
+#include <linux/bug.h>
+#include <linux/string.h>
+
+/*
+ * Data structure used to hold a min or max heap, the number of elements varies
+ * but the maximum size is fixed.
+ */
+struct min_max_heap {
+	/* Start of array holding the heap elements. */
+	void *data;
+	/* Number of elements currently in min-heap. */
+	int size;
+	/* Maximum number of elements that can be held in current storage. */
+	int cap;
+};
+
+struct min_max_heap_callbacks {
+	/* Size of elements in the heap. */
+	int elem_size;
+	/*
+	 * A function which returns *lhs < *rhs or *lhs > *rhs depending on
+	 * whether this is a min or a max heap. Note, another compare function
+	 * style in the kernel will return -ve, 0 and +ve and won't handle
+	 * minimum integer correctly if implemented as a subtract.
+	 */
+	bool (*cmp)(const void *lhs, const void *rhs);
+	/* Swap the element values at lhs with those at rhs. */
+	void (*swp)(void *lhs, void *rhs);
+};
+
+/* Sift the element at pos down the heap. */
+static inline void heapify(struct min_max_heap *heap, int pos,
+			const struct min_max_heap_callbacks *func) {
+	void *left_child, *right_child, *parent, *large_or_smallest;
+	char *data = (char *)heap->data;
+
+	for (;;) {
+		if (pos * 2 + 1 >= heap->size)
+			break;
+
+		left_child = data + ((pos * 2 + 1) * func->elem_size);
+		parent = data + (pos * func->elem_size);
+		large_or_smallest = parent;
+		if (func->cmp(left_child, large_or_smallest))
+			large_or_smallest = left_child;
+
+		if (pos * 2 + 2 < heap->size) {
+			right_child = data + ((pos * 2 + 2) * func->elem_size);
+			if (func->cmp(right_child, large_or_smallest))
+				large_or_smallest = right_child;
+		}
+		if (large_or_smallest == parent)
+			break;
+		func->swp(large_or_smallest, parent);
+		if (large_or_smallest == left_child)
+			pos = (pos * 2) + 1;
+		else
+			pos = (pos * 2) + 2;
+	}
+}
+
+/* Floyd's approach to heapification that is O(size). */
+static inline void
+heapify_all(struct min_max_heap *heap,
+	const struct min_max_heap_callbacks *func)
+{
+	int i;
+
+	for (i = heap->size / 2; i >= 0; i--)
+		heapify(heap, i, func);
+}
+
+/* Remove minimum element from the heap, O(log2(size)). */
+static inline void
+heap_pop(struct min_max_heap *heap, const struct min_max_heap_callbacks *func)
+{
+	char *data = (char *)heap->data;
+
+	if (WARN_ONCE(heap->size <= 0, "Popping an empty heap"))
+		return;
+
+	/* Place last element at the root (position 0) and then sift down. */
+	heap->size--;
+	memcpy(data, data + (heap->size * func->elem_size), func->elem_size);
+	heapify(heap, 0, func);
+}
+
+/*
+ * Remove the minimum element and then push the given element. The
+ * implementation performs 1 sift (O(log2(size))) and is therefore more
+ * efficient than a pop followed by a push that does 2.
+ */
+static void heap_pop_push(struct min_max_heap *heap,
+			const void *element,
+			const struct min_max_heap_callbacks *func)
+{
+	char *data = (char *)heap->data;
+
+	memcpy(data, element, func->elem_size);
+	heapify(heap, 0, func);
+}
+
+/* Push an element on to the heap, O(log2(size)). */
+static inline void
+heap_push(struct min_max_heap *heap, const void *element,
+	const struct min_max_heap_callbacks *func)
+{
+	void *child, *parent;
+	int pos;
+	char *data = (char *)heap->data;
+
+	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
+		return;
+
+	/* Place at the end of data. */
+	pos = heap->size;
+	memcpy(data + (pos * func->elem_size), element, func->elem_size);
+	heap->size++;
+
+	/* Sift up. */
+	for (; pos > 0; pos = (pos - 1) / 2) {
+		child = data + (pos * func->elem_size);
+		parent = data + ((pos - 1) / 2) * func->elem_size;
+		if (func->cmp(parent, child))
+			break;
+		func->swp(parent, child);
+	}
+}
+
+#endif /* _LINUX_MIN_MAX_HEAP_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93d97f9b0157..6a2cf82515eb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1693,6 +1693,16 @@ config TEST_LIST_SORT
 
 	  If unsure, say N.
 
+config TEST_MIN_MAX_HEAP
+	tristate "Min-max heap test"
+	depends on DEBUG_KERNEL || m
+	help
+	  Enable this to turn on min-max heap function tests. This test is
+	  executed only once during system boot (so affects only boot time),
+	  or at module load time.
+
+	  If unsure, say N.
+
 config TEST_SORT
 	tristate "Array-based sort test"
 	depends on DEBUG_KERNEL || m
diff --git a/lib/Makefile b/lib/Makefile
index c5892807e06f..e73df06adaab 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -67,6 +67,7 @@ CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
+obj-$(CONFIG_TEST_MIN_MAX_HEAP) += test_min_max_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
diff --git a/lib/test_min_max_heap.c b/lib/test_min_max_heap.c
new file mode 100644
index 000000000000..72c756d96e5e
--- /dev/null
+++ b/lib/test_min_max_heap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "min_max_heap_test: " fmt
+
+/*
+ * Test cases for the min max heap.
+ */
+
+#include <linux/log2.h>
+#include <linux/min_max_heap.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+
+static __init bool less_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs < *(int *)rhs;
+}
+
+static __init bool greater_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs > *(int *)rhs;
+}
+
+static __init void swap_ints(void *lhs, void *rhs)
+{
+	int temp = *(int *)lhs;
+
+	*(int *)lhs = *(int *)rhs;
+	*(int *)rhs = temp;
+}
+
+static __init int pop_verify_heap(bool min_heap,
+				struct min_max_heap *heap,
+				const struct min_max_heap_callbacks *funcs)
+{
+	int last;
+	int *values = (int *)heap->data;
+	int err = 0;
+
+	last = values[0];
+	heap_pop(heap, funcs);
+	while (heap->size > 0) {
+		if (min_heap) {
+			if (last > values[0]) {
+				pr_err("error: expected %d <= %d\n", last,
+					values[0]);
+				err++;
+			}
+		} else {
+			if (last < values[0]) {
+				pr_err("error: expected %d >= %d\n", last,
+					values[0]);
+				err++;
+			}
+		}
+		last = values[0];
+		heap_pop(heap, funcs);
+	}
+	return err;
+}
+
+static __init int test_heapify_all(bool min_heap)
+{
+	int values[] = { 3, 1, 2, 4, 0x8000000, 0x7FFFFFF, 0,
+			 -3, -1, -2, -4, 0x8000000, 0x7FFFFFF };
+	struct min_max_heap heap = {
+		.data = values,
+		.size = ARRAY_SIZE(values),
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, err;
+
+	/* Test with known set of values. */
+	heapify_all(&heap, &funcs);
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+
+	/* Test with randomly generated values. */
+	heap.size = ARRAY_SIZE(values);
+	for (i = 0; i < heap.size; i++)
+		values[i] = get_random_int();
+
+	heapify_all(&heap, &funcs);
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_max_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		heap_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	/* Test with randomly generated values. */
+	while (heap.size < heap.cap) {
+		temp = get_random_int();
+		heap_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_pop_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_max_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Fill values with data to pop and replace. */
+	temp = min_heap ? 0x80000000 : 0x7FFFFFFF;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		heap_push(&heap, &temp, &funcs);
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		heap_pop_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	heap.size = 0;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		heap_push(&heap, &temp, &funcs);
+
+	/* Test with randomly generated values. */
+	for (i = 0; i < ARRAY_SIZE(data); i++) {
+		temp = get_random_int();
+		heap_pop_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static int __init test_min_max_heap_init(void)
+{
+	int err = 0;
+
+	err += test_heapify_all(true);
+	err += test_heapify_all(false);
+	err += test_heap_push(true);
+	err += test_heap_push(false);
+	err += test_heap_pop_push(true);
+	err += test_heap_pop_push(false);
+	if (err) {
+		pr_err("test failed with %d errors\n", err);
+		return -EINVAL;
+	}
+	pr_info("test passed\n");
+	return 0;
+}
+module_init(test_min_max_heap_init);
+
+static void __exit test_min_max_heap_exit(void)
+{
+	/* do nothing */
+}
+module_exit(test_min_max_heap_exit);
+
+MODULE_LICENSE("GPL");
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  9:39   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 67 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 50 insertions(+), 17 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0dce28b0aae0..a5a3d349a8f1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/min_max_heap.h>
 
 #include "internal.h"
 
@@ -3372,32 +3373,64 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static int visit_groups_merge(struct perf_event_groups *groups, int cpu,
+static bool perf_cmp_group_idx(const void *l, const void *r)
+{
+	const struct perf_event *le = l, *re = r;
+
+	return le->group_index < re->group_index;
+}
+
+static void swap_ptr(void *l, void *r)
+{
+	void **lp = l, **rp = r;
+
+	swap(*lp, *rp);
+}
+
+static const struct min_max_heap_callbacks perf_min_heap = {
+	.elem_size = sizeof(struct perf_event *),
+	.cmp = perf_cmp_group_idx,
+	.swp = swap_ptr,
+};
+
+static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
+{
+	struct perf_event **itrs = heap->data;
+
+	if (event) {
+		itrs[heap->size] = event;
+		heap->size++;
+	}
+}
+
+static noinline int visit_groups_merge(struct perf_event_groups *groups, int cpu,
 			      int (*func)(struct perf_event *, void *), void *data)
 {
-	struct perf_event **evt, *evt1, *evt2;
+	/* Space for per CPU and/or any CPU event iterators. */
+	struct perf_event *itrs[2];
+	struct min_max_heap event_heap = {
+		.data = itrs,
+		.size = 0,
+		.cap = ARRAY_SIZE(itrs),
+	};
+	struct perf_event *next;
 	int ret;
 
-	evt1 = perf_event_groups_first(groups, -1);
-	evt2 = perf_event_groups_first(groups, cpu);
+	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	while (evt1 || evt2) {
-		if (evt1 && evt2) {
-			if (evt1->group_index < evt2->group_index)
-				evt = &evt1;
-			else
-				evt = &evt2;
-		} else if (evt1) {
-			evt = &evt1;
-		} else {
-			evt = &evt2;
-		}
+	heapify_all(&event_heap, &perf_min_heap);
 
-		ret = func(*evt, data);
+	while (event_heap.size) {
+		ret = func(itrs[0], data);
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		next = perf_event_groups_next(itrs[0]);
+		if (next)
+			heap_pop_push(&event_heap, &next, &perf_min_heap);
+		else
+			heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (2 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  9:51   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 ++++++
 kernel/events/core.c       | 49 ++++++++++++++++++++++++++++----------
 2 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbdbccc2..b3580afbf358 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -835,6 +835,13 @@ struct perf_cpu_context {
 	int				sched_cb_usage;
 
 	int				online;
+	/*
+	 * Per-CPU storage for iterators used in visit_groups_merge. The default
+	 * storage is of size 2 to hold the CPU and any CPU event iterators.
+	 */
+	int				itr_storage_cap;
+	struct perf_event		**itr_storage;
+	struct perf_event		*itr_default[2];
 };
 
 struct perf_output_handle {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a5a3d349a8f1..0dab60bf5935 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3403,30 +3403,46 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_event_groups *groups, int cpu,
-			      int (*func)(struct perf_event *, void *), void *data)
+static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+				struct perf_event_groups *groups, int cpu,
+				int (*func)(struct perf_event *, void *),
+				void *data)
 {
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
-	struct min_max_heap event_heap = {
-		.data = itrs,
-		.size = 0,
-		.cap = ARRAY_SIZE(itrs),
-	};
+	struct min_max_heap event_heap;
+	struct perf_event **evt;
 	struct perf_event *next;
 	int ret;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	if (cpuctx) {
+		event_heap = (struct min_max_heap){
+			.data = cpuctx->itr_storage,
+			.size = 0,
+			.cap = cpuctx->itr_storage_cap,
+		};
+	} else {
+		event_heap = (struct min_max_heap){
+			.data = itrs,
+			.size = 0,
+			.cap = ARRAY_SIZE(itrs),
+		};
+		/* Events not within a CPU context may be on any CPU. */
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+
+	}
+	evt = event_heap.data;
+
 	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
 	heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(itrs[0], data);
+		ret = func(*evt, data);
 		if (ret)
 			return ret;
 
-		next = perf_event_groups_next(itrs[0]);
+		next = perf_event_groups_next(*evt);
 		if (next)
 			heap_pop_push(&event_heap, &next, &perf_min_heap);
 		else
@@ -3500,7 +3516,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->pinned_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->pinned_groups,
 			   smp_processor_id(),
 			   pinned_sched_in, &sid);
 }
@@ -3515,7 +3534,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->flexible_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->flexible_groups,
 			   smp_processor_id(),
 			   flexible_sched_in, &sid);
 }
@@ -10182,6 +10204,9 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
+
+		cpuctx->itr_storage_cap = ARRAY_SIZE(cpuctx->itr_default);
+		cpuctx->itr_storage = cpuctx->itr_default;
 	}
 
 got_cpu_context:
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (3 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  9:54   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0dab60bf5935..3c44be7de44e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
 	rcu_read_unlock();
 }
 
+static int perf_cgroup_ensure_itr_storage_cap(struct perf_event *event,
+					struct cgroup_subsys_state *css)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_event **storage;
+	int cpu, itr_cap, ret = 0;
+
+	/*
+	 * Allow storage to have sufficent space for an iterator for each
+	 * possibly nested cgroup plus an iterator for events with no cgroup.
+	 */
+	for (itr_cap = 1; css; css = css->parent)
+		itr_cap++;
+
+	for_each_possible_cpu(cpu) {
+		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		if (itr_cap <= cpuctx->itr_storage_cap)
+			continue;
+
+		storage = kmalloc_node(itr_cap * sizeof(struct perf_event *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+		if (!storage) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		raw_spin_lock_irq(&cpuctx->ctx.lock);
+		if (cpuctx->itr_storage_cap < itr_cap) {
+			swap(cpuctx->itr_storage, storage);
+			if (storage == cpuctx->itr_default)
+				storage = NULL;
+			cpuctx->itr_storage_cap = itr_cap;
+		}
+		raw_spin_unlock_irq(&cpuctx->ctx.lock);
+
+		kfree(storage);
+	}
+
+	return ret;
+}
+
 static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		goto out;
 	}
 
+	ret = perf_cgroup_ensure_itr_storage_cap(event, css);
+	if (ret)
+		goto out;
+
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
@@ -3421,6 +3466,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.size = 0,
 			.cap = cpuctx->itr_storage_cap,
 		};
+
+		lockdep_assert_held(&cpuctx->ctx.lock);
 	} else {
 		event_heap = (struct min_max_heap){
 			.data = itrs,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 06/10] perf/cgroup: Order events in RB tree by cgroup id
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (4 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
---
 kernel/events/core.c | 97 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3c44be7de44e..cb5fc47611c7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1576,6 +1576,30 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right)
 	if (left->cpu > right->cpu)
 		return false;
 
+#ifdef CONFIG_CGROUP_PERF
+	if (left->cgrp != right->cgrp) {
+		if (!left->cgrp || !left->cgrp->css.cgroup) {
+			/*
+			 * Left has no cgroup but right does, no cgroups come
+			 * first.
+			 */
+			return true;
+		}
+		if (!right->cgrp || right->cgrp->css.cgroup) {
+			/*
+			 * Right has no cgroup but left does, no cgroups come
+			 * first.
+			 */
+			return false;
+		}
+		/* Two dissimilar cgroups, order by id. */
+		if (left->cgrp->css.cgroup->id < right->cgrp->css.cgroup->id)
+			return true;
+
+		return false;
+	}
+#endif
+
 	if (left->group_index < right->group_index)
 		return true;
 	if (left->group_index > right->group_index)
@@ -1655,25 +1679,48 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the @cpu subtree.
+ * Get the leftmost event in the cpu/cgroup subtree.
  */
 static struct perf_event *
-perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+perf_event_groups_first(struct perf_event_groups *groups, int cpu,
+			struct cgroup *cgrp)
 {
 	struct perf_event *node_event = NULL, *match = NULL;
 	struct rb_node *node = groups->tree.rb_node;
+#ifdef CONFIG_CGROUP_PERF
+	int node_cgrp_id, cgrp_id = 0;
+
+	if (cgrp)
+		cgrp_id = cgrp->id;
+#endif
 
 	while (node) {
 		node_event = container_of(node, struct perf_event, group_node);
 
 		if (cpu < node_event->cpu) {
 			node = node->rb_left;
-		} else if (cpu > node_event->cpu) {
+			continue;
+		}
+		if (cpu > node_event->cpu) {
 			node = node->rb_right;
-		} else {
-			match = node_event;
+			continue;
+		}
+#ifdef CONFIG_CGROUP_PERF
+		node_cgrp_id = 0;
+		if (node_event->cgrp && node_event->cgrp->css.cgroup)
+			node_cgrp_id = node_event->cgrp->css.cgroup->id;
+
+		if (cgrp_id < node_cgrp_id) {
 			node = node->rb_left;
+			continue;
+		}
+		if (cgrp_id > node_cgrp_id) {
+			node = node->rb_right;
+			continue;
 		}
+#endif
+		match = node_event;
+		node = node->rb_left;
 	}
 
 	return match;
@@ -1686,12 +1733,26 @@ static struct perf_event *
 perf_event_groups_next(struct perf_event *event)
 {
 	struct perf_event *next;
+#ifdef CONFIG_CGROUP_PERF
+	int curr_cgrp_id = 0;
+	int next_cgrp_id = 0;
+#endif
 
 	next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node);
-	if (next && next->cpu == event->cpu)
-		return next;
+	if (next == NULL || next->cpu != event->cpu)
+		return NULL;
 
-	return NULL;
+#ifdef CONFIG_CGROUP_PERF
+	if (event->cgrp && event->cgrp->css.cgroup)
+		curr_cgrp_id = event->cgrp->css.cgroup->id;
+
+	if (next->cgrp && next->cgrp->css.cgroup)
+		next_cgrp_id = next->cgrp->css.cgroup->id;
+
+	if (curr_cgrp_id != next_cgrp_id)
+		return NULL;
+#endif
+	return next;
 }
 
 /*
@@ -3453,6 +3514,9 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
+#ifdef CONFIG_CGROUP_PERF
+	struct cgroup_subsys_state *css = NULL;
+#endif
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_max_heap event_heap;
@@ -3468,6 +3532,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 		};
 
 		lockdep_assert_held(&cpuctx->ctx.lock);
+
+#ifdef CONFIG_CGROUP_PERF
+		if (cpuctx->cgrp)
+			css = &cpuctx->cgrp->css;
+#endif
 	} else {
 		event_heap = (struct min_max_heap){
 			.data = itrs,
@@ -3475,12 +3544,20 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
+									NULL));
 
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+
+#ifdef CONFIG_CGROUP_PERF
+	for (; css; css = css->parent) {
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
+								css->cgroup));
+	}
+#endif
 
 	heapify_all(&event_heap, &perf_min_heap);
 
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 07/10] perf: simplify and rename visit_groups_merge
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (5 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14 10:03   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

To enable a future caching optimization, pass in whether
visit_groups_merge is operating on pinned or flexible groups. The
is_pinned argument makes the func argument redundant, rename the
function to ctx_groups_sched_in as it just schedules pinned or flexible
groups in. Compute the cpu and groups arguments locally to reduce the
argument list size. Remove sched_in_data as it repeats arguments already
passed in. Remove the unused data argument to pinned_sched_in.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 106 +++++++++++++++++--------------------------
 1 file changed, 41 insertions(+), 65 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cb5fc47611c7..11594d8bbb2e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3509,10 +3509,18 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
-				struct perf_event_groups *groups, int cpu,
-				int (*func)(struct perf_event *, void *),
-				void *data)
+static int pinned_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event);
+
+static int flexible_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event,
+			int *can_add_hw);
+
+static int ctx_groups_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			bool is_pinned)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3522,9 +3530,13 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	struct min_max_heap event_heap;
 	struct perf_event **evt;
 	struct perf_event *next;
-	int ret;
+	int ret, can_add_hw = 1;
+	int cpu = smp_processor_id();
+	struct perf_event_groups *groups = is_pinned
+		? &ctx->pinned_groups
+		: &ctx->flexible_groups;
 
-	if (cpuctx) {
+	if (ctx == &cpuctx->ctx) {
 		event_heap = (struct min_max_heap){
 			.data = cpuctx->itr_storage,
 			.size = 0,
@@ -3562,7 +3574,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(*evt, data);
+		if (is_pinned)
+			ret = pinned_sched_in(ctx, cpuctx, *evt);
+		else
+			ret = flexible_sched_in(ctx, cpuctx, *evt, &can_add_hw);
+
 		if (ret)
 			return ret;
 
@@ -3576,25 +3592,19 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-struct sched_in_data {
-	struct perf_event_context *ctx;
-	struct perf_cpu_context *cpuctx;
-	int can_add_hw;
-};
-
-static int pinned_sched_in(struct perf_event *event, void *data)
+static int pinned_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event)
 {
-	struct sched_in_data *sid = data;
-
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		if (!group_sched_in(event, sid->cpuctx, sid->ctx))
-			list_add_tail(&event->active_list, &sid->ctx->pinned_active);
+	if (group_can_go_on(event, cpuctx, 1)) {
+		if (!group_sched_in(event, cpuctx, ctx))
+			list_add_tail(&event->active_list, &ctx->pinned_active);
 	}
 
 	/*
@@ -3607,65 +3617,30 @@ static int pinned_sched_in(struct perf_event *event, void *data)
 	return 0;
 }
 
-static int flexible_sched_in(struct perf_event *event, void *data)
+static int flexible_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event,
+			int *can_add_hw)
 {
-	struct sched_in_data *sid = data;
-
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		int ret = group_sched_in(event, sid->cpuctx, sid->ctx);
+	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
+		int ret = group_sched_in(event, cpuctx, ctx);
 		if (ret) {
-			sid->can_add_hw = 0;
-			sid->ctx->rotate_necessary = 1;
+			*can_add_hw = 0;
+			ctx->rotate_necessary = 1;
 			return 0;
 		}
-		list_add_tail(&event->active_list, &sid->ctx->flexible_active);
+		list_add_tail(&event->active_list, &ctx->flexible_active);
 	}
 
 	return 0;
 }
 
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->pinned_groups,
-			   smp_processor_id(),
-			   pinned_sched_in, &sid);
-}
-
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->flexible_groups,
-			   smp_processor_id(),
-			   flexible_sched_in, &sid);
-}
-
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -3702,11 +3677,12 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (6 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14 10:25   ` Peter Zijlstra
  2019-11-14  0:30 ` [PATCH v3 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Add a per-CPU cache of the pinned and flexible perf_event_groups_first
value for a cgroup avoiding an O(log(#perf events)) searches during
sched_in.

Based-on-work-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  6 +++
 kernel/events/core.c       | 79 +++++++++++++++++++++++++++-----------
 2 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b3580afbf358..cfd0b320418c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -877,6 +877,12 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	/* A cache of the first event with the perf_cpu_context's
+	 * perf_event_context for the first event in pinned_groups or
+	 * flexible_groups. Avoids an rbtree search during sched_in.
+	 */
+	struct perf_event * __percpu    *pinned_event;
+	struct perf_event * __percpu    *flexible_event;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 11594d8bbb2e..9f0febf51d97 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1638,6 +1638,25 @@ perf_event_groups_insert(struct perf_event_groups *groups,
 
 	rb_link_node(&event->group_node, parent, node);
 	rb_insert_color(&event->group_node, &groups->tree);
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned)
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		else
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+		/*
+		 * Cgroup events for the same cgroup on the same CPU will
+		 * always be inserted at the right because of bigger
+		 * @groups->index. Only need to set *cgrp_event when it's NULL.
+		 */
+		if (!*cgrp_event)
+			*cgrp_event = event;
+	}
+#endif
 }
 
 /*
@@ -1652,6 +1671,9 @@ add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
 	perf_event_groups_insert(groups, event);
 }
 
+static struct perf_event *
+perf_event_groups_next(struct perf_event *event);
+
 /*
  * Delete a group from a tree.
  */
@@ -1662,6 +1684,22 @@ perf_event_groups_delete(struct perf_event_groups *groups,
 	WARN_ON_ONCE(RB_EMPTY_NODE(&event->group_node) ||
 		     RB_EMPTY_ROOT(&groups->tree));
 
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned)
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		else
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+
+		if (*cgrp_event == event)
+			*cgrp_event = perf_event_groups_next(event);
+	}
+#endif
+
 	rb_erase(&event->group_node, &groups->tree);
 	init_event_group(event);
 }
@@ -1679,20 +1717,14 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the cpu subtree without a cgroup (ie task or
+ * system-wide).
  */
 static struct perf_event *
-perf_event_groups_first(struct perf_event_groups *groups, int cpu,
-			struct cgroup *cgrp)
+perf_event_groups_first_no_cgroup(struct perf_event_groups *groups, int cpu)
 {
 	struct perf_event *node_event = NULL, *match = NULL;
 	struct rb_node *node = groups->tree.rb_node;
-#ifdef CONFIG_CGROUP_PERF
-	int node_cgrp_id, cgrp_id = 0;
-
-	if (cgrp)
-		cgrp_id = cgrp->id;
-#endif
 
 	while (node) {
 		node_event = container_of(node, struct perf_event, group_node);
@@ -1706,18 +1738,10 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
 			continue;
 		}
 #ifdef CONFIG_CGROUP_PERF
-		node_cgrp_id = 0;
-		if (node_event->cgrp && node_event->cgrp->css.cgroup)
-			node_cgrp_id = node_event->cgrp->css.cgroup->id;
-
-		if (cgrp_id < node_cgrp_id) {
+		if (node_event->cgrp) {
 			node = node->rb_left;
 			continue;
 		}
-		if (cgrp_id > node_cgrp_id) {
-			node = node->rb_right;
-			continue;
-		}
 #endif
 		match = node_event;
 		node = node->rb_left;
@@ -3556,18 +3580,27 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
-									NULL));
+		__heap_add(&event_heap,
+			perf_event_groups_first_no_cgroup(groups, -1));
 
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+	__heap_add(&event_heap,
+		perf_event_groups_first_no_cgroup(groups, cpu));
 
 #ifdef CONFIG_CGROUP_PERF
 	for (; css; css = css->parent) {
-		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
-								css->cgroup));
+		struct perf_cgroup *cgrp;
+
+		/* root cgroup doesn't have events */
+		if (css->id == 1)
+			break;
+
+		cgrp = container_of(css, struct perf_cgroup, css);
+		__heap_add(&event_heap, is_pinned
+			? *per_cpu_ptr(cgrp->pinned_event, cpu)
+			: *per_cpu_ptr(cgrp->flexible_event, cpu));
 	}
 #endif
 
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 09/10] perf: optimize event_filter_match during sched_in
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (7 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14  0:30 ` [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The caller verified the CPU and cgroup so directly call
pmu_filter_match.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9f0febf51d97..99ac8248a9b6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2196,8 +2196,11 @@ static inline int pmu_filter_match(struct perf_event *event)
 static inline int
 event_filter_match(struct perf_event *event)
 {
-	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
-	       perf_cgroup_match(event) && pmu_filter_match(event);
+	if (event->cpu != -1 && event->cpu != smp_processor_id())
+		return 0;
+	if (!perf_cgroup_match(event))
+		return 0;
+	return pmu_filter_match(event);
 }
 
 static void
@@ -3632,7 +3635,11 @@ static int pinned_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
-	if (!event_filter_match(event))
+	/*
+	 * Avoid full event_filter_match as the caller verified the CPU and
+	 * cgroup before calling.
+	 */
+	if (!pmu_filter_match(event))
 		return 0;
 
 	if (group_can_go_on(event, cpuctx, 1)) {
@@ -3658,7 +3665,11 @@ static int flexible_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
-	if (!event_filter_match(event))
+	/*
+	 * Avoid full event_filter_match as the caller verified the CPU and
+	 * cgroup before calling.
+	 */
+	if (!pmu_filter_match(event))
 		return 0;
 
 	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (8 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
@ 2019-11-14  0:30 ` Ian Rogers
  2019-11-14 10:43   ` Peter Zijlstra
  2019-11-14  0:42 ` [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Kan Liang <kan.liang@linux.intel.com>

When counting system-wide events and cgroup events simultaneously, the
system-wide events are always scheduled out then back in during cgroup
switches, bringing extra overhead and possibly missing events. Switching
out system wide flexible events may be necessary if the scheduled in
task's cgroups have pinned events that need to be scheduled in at a higher
priority than the system wide flexible events.

Here is test with 6 child cgroups (sibling cgroups), 1 parent cgroup
and system-wide events.
A specjbb benchmark is running in each child cgroup.
The perf command is as below.
   perf stat -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -G cgroup1,cgroup1,cgroup2,cgroup2,cgroup3,cgroup3
   -G cgroup4,cgroup4,cgroup5,cgroup5,cgroup6,cgroup6
   -G cgroup_parent,cgroup_parent
   -a -e cycles,instructions -I 1000

The average RT (Response Time) reported from specjbb is
used as key performance metrics. (The lower the better)
                                        RT(us)              Overhead
Baseline (no perf stat):                4286.9
Use cgroup perf, no patches:            4537.1                5.84%
Use cgroup perf, apply the patch:       4440.7                3.59%

Fixes: e5d1367f17ba ("perf: Add cgroup support")
---
This patch was rebased based on: https://lkml.org/lkml/2019/8/7/771
with some minor changes to comments made by: Ian Rogers
<irogers@google.com>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |   1 +
 kernel/events/core.c       | 150 +++++++++++++++++++++++++++++++++----
 2 files changed, 135 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cfd0b320418c..f79f1cf1c2fb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -877,6 +877,7 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	unsigned int			nr_pinned_event;
 	/* A cache of the first event with the perf_cpu_context's
 	 * perf_event_context for the first event in pinned_groups or
 	 * flexible_groups. Avoids an rbtree search during sched_in.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 99ac8248a9b6..eb61c7b5157f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -362,8 +362,18 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
+
+	/* see perf_cgroup_switch() for details */
+	EVENT_CGROUP_FLEXIBLE_ONLY = 0x10,
+	EVENT_CGROUP_PINNED_ONLY = 0x20,
+	EVENT_CGROUP_ALL_ONLY = EVENT_CGROUP_FLEXIBLE_ONLY |
+				EVENT_CGROUP_PINNED_ONLY,
+
 };
 
+#define CGROUP_PINNED(type)	(type & EVENT_CGROUP_PINNED_ONLY)
+#define CGROUP_FLEXIBLE(type)	(type & EVENT_CGROUP_FLEXIBLE_ONLY)
+
 /*
  * perf_sched_events : >0 events exist
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
@@ -668,6 +678,20 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
 
 #ifdef CONFIG_CGROUP_PERF
 
+/* Skip system-wide CPU events if only cgroup events are required. */
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool pinned)
+{
+	if (event->cgrp)
+		return 0;
+	if (pinned)
+		return !!CGROUP_PINNED(event_type);
+	else
+		return !!CGROUP_FLEXIBLE(event_type);
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -694,6 +718,8 @@ perf_cgroup_match(struct perf_event *event)
 
 static inline void perf_detach_cgroup(struct perf_event *event)
 {
+	if (event->attr.pinned)
+		event->cgrp->nr_pinned_event--;
 	css_put(&event->cgrp->css);
 	event->cgrp = NULL;
 }
@@ -781,6 +807,22 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	}
 }
 
+/* Check if cgroup and its ancestor have pinned events attached */
+static bool
+cgroup_has_pinned_events(struct perf_cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct perf_cgroup *tmp_cgrp;
+
+	for (css = &cgrp->css; css; css = css->parent) {
+		tmp_cgrp = container_of(css, struct perf_cgroup, css);
+		if (tmp_cgrp->nr_pinned_event > 0)
+			return true;
+	}
+
+	return false;
+}
+
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 #define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
@@ -812,7 +854,22 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			/*
+			 * The system-wide events and cgroup events share the
+			 * same cpuctx groups. Decide which events to be
+			 * scheduled outbased on the types of events:
+			 * - EVENT_FLEXIBLE | EVENT_CGROUP_FLEXIBLE_ONLY:
+			 *   Only switch cgroup events from EVENT_FLEXIBLE
+			 *   groups.
+			 * - EVENT_PINNED | EVENT_CGROUP_PINNED_ONLY:
+			 *   Only switch cgroup events from EVENT_PINNED
+			 *   groups.
+			 * - EVENT_ALL | EVENT_CGROUP_ALL_ONLY:
+			 *   Only switch cgroup events from both EVENT_FLEXIBLE
+			 *   and EVENT_PINNED groups.
+			 */
+			cpu_ctx_sched_out(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -831,7 +888,23 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 */
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+
+			/*
+			 * To keep the priority order of cpu pinned then cpu
+			 * flexible, if the new cgroup has pinned events then
+			 * sched out all system-wide flexilbe events before
+			 * sched in all events.
+			 */
+			if (cgroup_has_pinned_events(cpuctx->cgrp)) {
+				cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_PINNED_ONLY,
+					task);
+			} else {
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY,
+					task);
+			}
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -959,6 +1032,9 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
+	if (event->attr.pinned)
+		cgrp->nr_pinned_event++;
+
 	/*
 	 * all events in a group must monitor
 	 * the same cgroup because a task belongs
@@ -1032,6 +1108,14 @@ list_update_cgroup_event(struct perf_event *event,
 
 #else /* !CONFIG_CGROUP_PERF */
 
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool pinned)
+{
+	return false;
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -3203,13 +3287,25 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->pinned_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, true)) {
+				ctx->is_active |= EVENT_PINNED;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->flexible_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, false)) {
+				ctx->is_active |= EVENT_FLEXIBLE;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3538,16 +3634,19 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 
 static int pinned_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
-			struct perf_event *event);
+			struct perf_event *event,
+			enum event_type_t event_type);
 
 static int flexible_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
 			struct perf_event *event,
+			enum event_type_t event_type,
 			int *can_add_hw);
 
 static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
-			bool is_pinned)
+			bool is_pinned,
+			enum event_type_t event_type)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3610,10 +3709,12 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 	heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		if (is_pinned)
-			ret = pinned_sched_in(ctx, cpuctx, *evt);
-		else
-			ret = flexible_sched_in(ctx, cpuctx, *evt, &can_add_hw);
+		if (is_pinned) {
+			ret = pinned_sched_in(ctx, cpuctx, *evt, event_type);
+		} else {
+			ret = flexible_sched_in(ctx, cpuctx, *evt, event_type,
+						&can_add_hw);
+		}
 
 		if (ret)
 			return ret;
@@ -3630,11 +3731,15 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 
 static int pinned_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
-			struct perf_event *event)
+			struct perf_event *event,
+			enum event_type_t event_type)
 {
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
+	if (perf_cgroup_skip_switch(event_type, event, true))
+		return 0;
+
 	/*
 	 * Avoid full event_filter_match as the caller verified the CPU and
 	 * cgroup before calling.
@@ -3660,11 +3765,15 @@ static int pinned_sched_in(struct perf_event_context *ctx,
 static int flexible_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
 			struct perf_event *event,
+			enum event_type_t event_type,
 			int *can_add_hw)
 {
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
+	if (perf_cgroup_skip_switch(event_type, event, true))
+		return 0;
+
 	/*
 	 * Avoid full event_filter_match as the caller verified the CPU and
 	 * cgroup before calling.
@@ -3691,6 +3800,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     enum event_type_t event_type,
 	     struct task_struct *task)
 {
+	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
 	int is_active = ctx->is_active;
 	u64 now;
 
@@ -3699,7 +3809,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= (ctx_event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3719,14 +3829,22 @@ ctx_sched_in(struct perf_event_context *ctx,
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
+	 *
+	 * System-wide events may not have been scheduled out for a cgroup
+	 * switch.  Unconditionally call sched_in() for cgroup events and
+	 * it will filter the events.
 	 */
-	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+	if ((is_active & EVENT_PINNED) || CGROUP_PINNED(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true,
+				CGROUP_PINNED(event_type));
+	}
 
 
 	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
+	if ((is_active & EVENT_FLEXIBLE) || CGROUP_FLEXIBLE(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false,
+				CGROUP_FLEXIBLE(event_type));
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 00/10] Optimize cgroup context switch
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (9 preceding siblings ...)
  2019-11-14  0:30 ` [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
@ 2019-11-14  0:42 ` Ian Rogers
  2019-11-14 10:45 ` Peter Zijlstra
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
  12 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-14  0:42 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML
  Cc: Stephane Eranian, Andi Kleen

Apologies, I missed the in-reply-to
<20190724223746.153620-1-irogers@google.com>.

Ian

On Wed, Nov 13, 2019 at 4:30 PM Ian Rogers <irogers@google.com> wrote:
>
> Avoid iterating over all per-CPU events during cgroup changing context
> switches by organizing events by cgroup.
>
> To make an efficient set of iterators, introduce a min max heap
> utility with test.
>
> These patches include a caching algorithm to improve the search for
> the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
> well as rebasing hit "optimize event_filter_match during sched_in"
> from https://lkml.org/lkml/2019/8/7/771.
>
> The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
> branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git
>
> These patches follow Peter's reorganization and his fixes to the
> perf_cpu_context min_heap storage code.
>
> Ian Rogers (8):
>   lib: introduce generic min max heap
>   perf: Use min_max_heap in visit_groups_merge
>   perf: Add per perf_cpu_context min_heap storage
>   perf/cgroup: Grow per perf_cpu_context heap storage
>   perf/cgroup: Order events in RB tree by cgroup id
>   perf: simplify and rename visit_groups_merge
>   perf: cache perf_event_groups_first for cgroups
>   perf: optimize event_filter_match during sched_in
>
> Kan Liang (1):
>   perf/cgroup: Do not switch system-wide events in cgroup switch
>
> Peter Zijlstra (1):
>   perf/cgroup: Reorder perf_cgroup_connect()
>
>  include/linux/min_max_heap.h | 134 +++++++++
>  include/linux/perf_event.h   |  14 +
>  kernel/events/core.c         | 512 ++++++++++++++++++++++++++++-------
>  lib/Kconfig.debug            |  10 +
>  lib/Makefile                 |   1 +
>  lib/test_min_max_heap.c      | 194 +++++++++++++
>  6 files changed, 769 insertions(+), 96 deletions(-)
>  create mode 100644 include/linux/min_max_heap.h
>  create mode 100644 lib/test_min_max_heap.c
>
> --
> 2.24.0.432.g9d3f5f5b63-goog
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect()
  2019-11-14  0:30 ` [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2019-11-14  8:50   ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  8:50 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen


Hurm, you didn't fix my missing Changelog.. 

On Wed, Nov 13, 2019 at 04:30:33PM -0800, Ian Rogers wrote:
> From: Peter Zijlstra <peterz@infradead.org>

Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  kernel/events/core.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index cfd89b4a02d8..0dce28b0aae0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -10597,12 +10597,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	if (!has_branch_stack(event))
>  		event->attr.branch_sample_type = 0;
>  
> -	if (cgroup_fd != -1) {
> -		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
> -		if (err)
> -			goto err_ns;
> -	}
> -
>  	pmu = perf_init_event(event);
>  	if (IS_ERR(pmu)) {
>  		err = PTR_ERR(pmu);
> @@ -10615,6 +10609,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		goto err_pmu;
>  	}
>  
> +	if (cgroup_fd != -1) {
> +		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
> +		if (err)
> +			goto err_pmu;
> +	}
> +
>  	err = exclusive_event_init(event);
>  	if (err)
>  		goto err_pmu;
> @@ -10675,12 +10675,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	exclusive_event_destroy(event);
>  
>  err_pmu:
> +	if (is_cgroup_event(event))
> +		perf_detach_cgroup(event);
>  	if (event->destroy)
>  		event->destroy(event);
>  	module_put(pmu->module);
>  err_ns:
> -	if (is_cgroup_event(event))
> -		perf_detach_cgroup(event);
>  	if (event->ns)
>  		put_pid_ns(event->ns);
>  	if (event->hw.target)
> -- 
> 2.24.0.432.g9d3f5f5b63-goog
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
@ 2019-11-14  9:32   ` Peter Zijlstra
  2019-11-14  9:35   ` Peter Zijlstra
  2019-11-17 18:28   ` Joe Perches
  2 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  9:32 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:34PM -0800, Ian Rogers wrote:
> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  include/linux/min_max_heap.h | 134 ++++++++++++++++++++++++
>  lib/Kconfig.debug            |  10 ++
>  lib/Makefile                 |   1 +
>  lib/test_min_max_heap.c      | 194 +++++++++++++++++++++++++++++++++++
>  4 files changed, 339 insertions(+)
>  create mode 100644 include/linux/min_max_heap.h
>  create mode 100644 lib/test_min_max_heap.c
> 
> diff --git a/include/linux/min_max_heap.h b/include/linux/min_max_heap.h
> new file mode 100644
> index 000000000000..ea7764a8252a
> --- /dev/null
> +++ b/include/linux/min_max_heap.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MIN_MAX_HEAP_H
> +#define _LINUX_MIN_MAX_HEAP_H
> +
> +#include <linux/bug.h>
> +#include <linux/string.h>
> +

Make this a kerneldoc comment and loose the comments in the structure.

> +/*
> + * Data structure used to hold a min or max heap, the number of elements varies
> + * but the maximum size is fixed.
> + */
> +struct min_max_heap {
> +	/* Start of array holding the heap elements. */
> +	void *data;
> +	/* Number of elements currently in min-heap. */
> +	int size;
> +	/* Maximum number of elements that can be held in current storage. */
> +	int cap;

You've got the naming all wrong; size is the size of the allocation, num
is the number of elements in use.

> +};
> +

Maybe do a kerneldoc comment for this structure too, that keeps the
definition less cluttered.

/**
 * struct min_max_heap_callbacks - const data/functions to customise the minmax heap
 * @elem_size:		the size of each element in bytes
 * @cmp:		partial order function for this heap
 *			'less'/'<' for min-heap, 'greater'/'>' for max-heap
 * @swp:		swap function.
 */
> +struct min_max_heap_callbacks {
> +	/* Size of elements in the heap. */
> +	int elem_size;
> +	/*
> +	 * A function which returns *lhs < *rhs or *lhs > *rhs depending on
> +	 * whether this is a min or a max heap. Note, another compare function
> +	 * style in the kernel will return -ve, 0 and +ve and won't handle
> +	 * minimum integer correctly if implemented as a subtract.
> +	 */
> +	bool (*cmp)(const void *lhs, const void *rhs);
> +	/* Swap the element values at lhs with those at rhs. */
> +	void (*swp)(void *lhs, void *rhs);
> +};

Personally I'd just call the whole thing a minheap and call the compare
function less and leave it at that. Sure if you flip the order you'll
get a maxheap but that's fairly obvious.

> +
> +/* Sift the element at pos down the heap. */
> +static inline void heapify(struct min_max_heap *heap, int pos,

Given this lives in the global namespace (this is C), maybe pick a
slightly more specific name, like min_max_heapify().

> +			const struct min_max_heap_callbacks *func) {

This is against coding style. Functions get their opening brace on a new
line. The rest of your patch has this right, why not this one?

> +	void *left_child, *right_child, *parent, *large_or_smallest;

I'm not a fan of excessively long variable names, they make it so much
harder to read code.

	void *left, *right, *parent, *pivot;

> +	char *data = (char *)heap->data;

What's the deal with that char nonsense? GCC does void* arithmetic just
right, also C will silently cast void* to any other pointer type.

> +
> +	for (;;) {
> +		if (pos * 2 + 1 >= heap->size)
> +			break;
> +
> +		left_child = data + ((pos * 2 + 1) * func->elem_size);
> +		parent = data + (pos * func->elem_size);

You can reduce the number of multiplications here. You have 3, and IIRC
you only need 1.

Set parent before the loop, compute right as left + size, and hand
either down as parent for the next iteration.

> +		large_or_smallest = parent;
> +		if (func->cmp(left_child, large_or_smallest))
> +			large_or_smallest = left_child;
> +
> +		if (pos * 2 + 2 < heap->size) {
> +			right_child = data + ((pos * 2 + 2) * func->elem_size);
> +			if (func->cmp(right_child, large_or_smallest))
> +				large_or_smallest = right_child;
> +		}
> +		if (large_or_smallest == parent)
> +			break;
> +		func->swp(large_or_smallest, parent);
> +		if (large_or_smallest == left_child)
> +			pos = (pos * 2) + 1;
> +		else
> +			pos = (pos * 2) + 2;
> +	}

I'm a little confused, normally (2*pos) is left and (2*pos+1) is right,
you seem to have used (2*pos + 1) and (2*pos + 2).

Also, I'm thinking the above can be helped with a little helper:

static inline int min_max_child(int pos, bool right)
{
	return 2 * pos + 1 + right;
}

> +}
> +
> +/* Floyd's approach to heapification that is O(size). */
> +static inline void
> +heapify_all(struct min_max_heap *heap,

min_max_heapify_all()

> +	const struct min_max_heap_callbacks *func)
> +{
> +	int i;
> +
> +	for (i = heap->size / 2; i >= 0; i--)

Where does that >= come from?

> +		heapify(heap, i, func);
> +}
> +
> +/* Remove minimum element from the heap, O(log2(size)). */
> +static inline void
> +heap_pop(struct min_max_heap *heap, const struct min_max_heap_callbacks *func)
> +{
> +	char *data = (char *)heap->data;

more silly char stuff

> +
> +	if (WARN_ONCE(heap->size <= 0, "Popping an empty heap"))
> +		return;
> +
> +	/* Place last element at the root (position 0) and then sift down. */
> +	heap->size--;
> +	memcpy(data, data + (heap->size * func->elem_size), func->elem_size);
> +	heapify(heap, 0, func);
> +}
> +
> +/*
> + * Remove the minimum element and then push the given element. The
> + * implementation performs 1 sift (O(log2(size))) and is therefore more
> + * efficient than a pop followed by a push that does 2.
> + */
> +static void heap_pop_push(struct min_max_heap *heap,
> +			const void *element,
> +			const struct min_max_heap_callbacks *func)
> +{
> +	char *data = (char *)heap->data;

delete it already

> +
> +	memcpy(data, element, func->elem_size);
> +	heapify(heap, 0, func);
> +}
> +
> +/* Push an element on to the heap, O(log2(size)). */
> +static inline void
> +heap_push(struct min_max_heap *heap, const void *element,
> +	const struct min_max_heap_callbacks *func)
> +{
> +	void *child, *parent;
> +	int pos;
> +	char *data = (char *)heap->data;

there are no strings here...

> +
> +	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
> +		return;
> +
> +	/* Place at the end of data. */
> +	pos = heap->size;
> +	memcpy(data + (pos * func->elem_size), element, func->elem_size);
> +	heap->size++;
> +
> +	/* Sift up. */
> +	for (; pos > 0; pos = (pos - 1) / 2) {

And here you have '>' in direct conflict with heapify_all()

> +		child = data + (pos * func->elem_size);
> +		parent = data + ((pos - 1) / 2) * func->elem_size;
> +		if (func->cmp(parent, child))
> +			break;
> +		func->swp(parent, child);

		child = parent;

and loose one multiplcation.

> +	}
> +}


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
  2019-11-14  9:32   ` Peter Zijlstra
@ 2019-11-14  9:35   ` Peter Zijlstra
  2019-11-17 18:28   ` Joe Perches
  2 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  9:35 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:34PM -0800, Ian Rogers wrote:
> +/*
> + * Remove the minimum element and then push the given element. The
> + * implementation performs 1 sift (O(log2(size))) and is therefore more
> + * efficient than a pop followed by a push that does 2.
> + */
> +static void heap_pop_push(struct min_max_heap *heap,
> +			const void *element,
> +			const struct min_max_heap_callbacks *func)
> +{
> +	char *data = (char *)heap->data;
> +
> +	memcpy(data, element, func->elem_size);
> +	heapify(heap, 0, func);
> +}

I'm not a fan of this operation. It has a weird name and it is utterly
trivial.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge
  2019-11-14  0:30 ` [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
@ 2019-11-14  9:39   ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  9:39 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:35PM -0800, Ian Rogers wrote:

Changelog goes here. Mostly it's about how we want to extend the
merge-sort from the 2 inputs we have today.

> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ian Rogers <irogers@google.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage
  2019-11-14  0:30 ` [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2019-11-14  9:51   ` Peter Zijlstra
  2019-11-16  1:19     ` Ian Rogers
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  9:51 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:36PM -0800, Ian Rogers wrote:
> +	if (cpuctx) {
> +		event_heap = (struct min_max_heap){
> +			.data = cpuctx->itr_storage,
> +			.size = 0,

C guarantees that unnamed fields get to be 0

> +			.cap = cpuctx->itr_storage_cap,
> +		};
> +	} else {
> +		event_heap = (struct min_max_heap){
> +			.data = itrs,
> +			.size = 0,

idem.

> +			.cap = ARRAY_SIZE(itrs),
> +		};
> +		/* Events not within a CPU context may be on any CPU. */
> +		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
> +

suprious whitespace

> +	}
> +	evt = event_heap.data;
> +
>  	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage
  2019-11-14  0:30 ` [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2019-11-14  9:54   ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14  9:54 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:37PM -0800, Ian Rogers wrote:
> Allow the per-CPU min heap storage to have sufficient space for per-cgroup
> iterators.
> 
> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  kernel/events/core.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 0dab60bf5935..3c44be7de44e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
>  	rcu_read_unlock();
>  }
>  
> +static int perf_cgroup_ensure_itr_storage_cap(struct perf_event *event,

That's a ludicrous function name.

> +					struct cgroup_subsys_state *css)
> +{

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 07/10] perf: simplify and rename visit_groups_merge
  2019-11-14  0:30 ` [PATCH v3 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
@ 2019-11-14 10:03   ` Peter Zijlstra
  2019-11-16  1:20     ` Ian Rogers
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14 10:03 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:39PM -0800, Ian Rogers wrote:
> To enable a future caching optimization, pass in whether
> visit_groups_merge is operating on pinned or flexible groups. The
> is_pinned argument makes the func argument redundant, rename the
> function to ctx_groups_sched_in as it just schedules pinned or flexible
> groups in. Compute the cpu and groups arguments locally to reduce the
> argument list size. Remove sched_in_data as it repeats arguments already
> passed in. Remove the unused data argument to pinned_sched_in.

Where did my first two patches go? Why aren't
{pinned,flexible}_sched_in() merged?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups
  2019-11-14  0:30 ` [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
@ 2019-11-14 10:25   ` Peter Zijlstra
  2019-11-16  1:20     ` Ian Rogers
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14 10:25 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:40PM -0800, Ian Rogers wrote:
> Add a per-CPU cache of the pinned and flexible perf_event_groups_first
> value for a cgroup avoiding an O(log(#perf events)) searches during
> sched_in.
> 
> Based-on-work-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  include/linux/perf_event.h |  6 +++
>  kernel/events/core.c       | 79 +++++++++++++++++++++++++++-----------
>  2 files changed, 62 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index b3580afbf358..cfd0b320418c 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -877,6 +877,12 @@ struct perf_cgroup_info {
>  struct perf_cgroup {
>  	struct cgroup_subsys_state	css;
>  	struct perf_cgroup_info	__percpu *info;
> +	/* A cache of the first event with the perf_cpu_context's
> +	 * perf_event_context for the first event in pinned_groups or
> +	 * flexible_groups. Avoids an rbtree search during sched_in.
> +	 */

Broken comment style.

> +	struct perf_event * __percpu    *pinned_event;
> +	struct perf_event * __percpu    *flexible_event;

Where is the actual storage allocated? There is a conspicuous lack of
alloc_percpu() in this patch, see for example perf_cgroup_css_alloc()
which fills out the above @info field.

>  };
>  
>  /*
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 11594d8bbb2e..9f0febf51d97 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1638,6 +1638,25 @@ perf_event_groups_insert(struct perf_event_groups *groups,
>  
>  	rb_link_node(&event->group_node, parent, node);
>  	rb_insert_color(&event->group_node, &groups->tree);
> +#ifdef CONFIG_CGROUP_PERF
> +	if (is_cgroup_event(event)) {
> +		struct perf_event **cgrp_event;
> +
> +		if (event->attr.pinned)
> +			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
> +						event->cpu);
> +		else
> +			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
> +						event->cpu);

Codingstyle requires { } here (or just bust the line length a little).

> +		/*
> +		 * Cgroup events for the same cgroup on the same CPU will
> +		 * always be inserted at the right because of bigger
> +		 * @groups->index. Only need to set *cgrp_event when it's NULL.
> +		 */
> +		if (!*cgrp_event)
> +			*cgrp_event = event;

I would feel much better if you had some actual leftmost logic in the
insertion iteration.

> +	}
> +#endif
>  }
>  
>  /*
> @@ -1652,6 +1671,9 @@ add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
>  	perf_event_groups_insert(groups, event);
>  }
>  
> +static struct perf_event *
> +perf_event_groups_next(struct perf_event *event);
> +
>  /*
>   * Delete a group from a tree.
>   */
> @@ -1662,6 +1684,22 @@ perf_event_groups_delete(struct perf_event_groups *groups,
>  	WARN_ON_ONCE(RB_EMPTY_NODE(&event->group_node) ||
>  		     RB_EMPTY_ROOT(&groups->tree));
>  
> +#ifdef CONFIG_CGROUP_PERF
> +	if (is_cgroup_event(event)) {
> +		struct perf_event **cgrp_event;
> +
> +		if (event->attr.pinned)
> +			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
> +						event->cpu);
> +		else
> +			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
> +						event->cpu);

Codingstyle again.

> +
> +		if (*cgrp_event == event)
> +			*cgrp_event = perf_event_groups_next(event);
> +	}
> +#endif
> +
>  	rb_erase(&event->group_node, &groups->tree);
>  	init_event_group(event);
>  }
> @@ -1679,20 +1717,14 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
>  }
>  
>  /*
> - * Get the leftmost event in the cpu/cgroup subtree.
> + * Get the leftmost event in the cpu subtree without a cgroup (ie task or
> + * system-wide).
>   */
>  static struct perf_event *
> -perf_event_groups_first(struct perf_event_groups *groups, int cpu,
> -			struct cgroup *cgrp)
> +perf_event_groups_first_no_cgroup(struct perf_event_groups *groups, int cpu)

I'm going to impose a function name length limit soon :/ That's insane
(again).

>  {
>  	struct perf_event *node_event = NULL, *match = NULL;
>  	struct rb_node *node = groups->tree.rb_node;
> -#ifdef CONFIG_CGROUP_PERF
> -	int node_cgrp_id, cgrp_id = 0;
> -
> -	if (cgrp)
> -		cgrp_id = cgrp->id;
> -#endif
>  
>  	while (node) {
>  		node_event = container_of(node, struct perf_event, group_node);
> @@ -1706,18 +1738,10 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
>  			continue;
>  		}
>  #ifdef CONFIG_CGROUP_PERF
> -		node_cgrp_id = 0;
> -		if (node_event->cgrp && node_event->cgrp->css.cgroup)
> -			node_cgrp_id = node_event->cgrp->css.cgroup->id;
> -
> -		if (cgrp_id < node_cgrp_id) {
> +		if (node_event->cgrp) {
>  			node = node->rb_left;
>  			continue;
>  		}
> -		if (cgrp_id > node_cgrp_id) {
> -			node = node->rb_right;
> -			continue;
> -		}
>  #endif
>  		match = node_event;
>  		node = node->rb_left;

Also, just leave that in and let callers have: .cgrp = NULL. Then you
can forgo that monstrous name.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14  0:30 ` [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
@ 2019-11-14 10:43   ` Peter Zijlstra
  2019-11-14 13:46     ` Liang, Kan
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14 10:43 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> When counting system-wide events and cgroup events simultaneously, the
> system-wide events are always scheduled out then back in during cgroup
> switches, bringing extra overhead and possibly missing events. Switching
> out system wide flexible events may be necessary if the scheduled in
> task's cgroups have pinned events that need to be scheduled in at a higher
> priority than the system wide flexible events.

I'm thinking this patch is actively broken. groups->index 'group' wide
and therefore across cpu/cgroup boundaries.

There is no !cgroup to cgroup hierarchy as this patch seems to assume,
specifically look at how the merge sort in visit_groups_merge() allows
cgroup events to be picked before !cgroup events.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 00/10] Optimize cgroup context switch
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (10 preceding siblings ...)
  2019-11-14  0:42 ` [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
@ 2019-11-14 10:45 ` Peter Zijlstra
  2019-11-14 18:17   ` Ian Rogers
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
  12 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14 10:45 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Wed, Nov 13, 2019 at 04:30:32PM -0800, Ian Rogers wrote:
> Avoid iterating over all per-CPU events during cgroup changing context
> switches by organizing events by cgroup.

When last we spoke (Plumbers in Lisbon) you mentioned that this
optimization was yielding far less than expected. You had graphs showing
how the use of cgroups impacted event scheduling time and how this patch
set only reduced that a little.

Any update on all that? There seems to be a conspicuous lack of such
data here.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14 10:43   ` Peter Zijlstra
@ 2019-11-14 13:46     ` Liang, Kan
  2019-11-14 13:57       ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Liang, Kan @ 2019-11-14 13:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, linux-kernel,
	Stephane Eranian, Andi Kleen



On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
> On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> When counting system-wide events and cgroup events simultaneously, the
>> system-wide events are always scheduled out then back in during cgroup
>> switches, bringing extra overhead and possibly missing events. Switching
>> out system wide flexible events may be necessary if the scheduled in
>> task's cgroups have pinned events that need to be scheduled in at a higher
>> priority than the system wide flexible events.
> 
> I'm thinking this patch is actively broken. groups->index 'group' wide
> and therefore across cpu/cgroup boundaries.
> 
> There is no !cgroup to cgroup hierarchy as this patch seems to assume,
> specifically look at how the merge sort in visit_groups_merge() allows
> cgroup events to be picked before !cgroup events.


No, the patch intends to avoid switch !cgroup during cgroup context 
switch.

In perf_cgroup_switch(), when the cgroup is scheduled out, current 
implementation schedule out everything including !cgroup. I think it 
definitely breaks the semantics of !cgroup aka system-wide event.

The patch itself doesn't touch the merge sort in visit_groups_merge().
The perf_cgroup_skip_switch() just skips the !cgroup in schedule_in().
Because !cgroup wasn't scheduled out. We don't want to schedule !cgroup 
in again.
The cgroup events must be after !cgroup events, since !cgroup never be 
switched.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14 13:46     ` Liang, Kan
@ 2019-11-14 13:57       ` Peter Zijlstra
  2019-11-14 15:16         ` Liang, Kan
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-14 13:57 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, linux-kernel,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 08:46:51AM -0500, Liang, Kan wrote:
> 
> 
> On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
> > > From: Kan Liang <kan.liang@linux.intel.com>
> > > 
> > > When counting system-wide events and cgroup events simultaneously, the
> > > system-wide events are always scheduled out then back in during cgroup
> > > switches, bringing extra overhead and possibly missing events. Switching
> > > out system wide flexible events may be necessary if the scheduled in
> > > task's cgroups have pinned events that need to be scheduled in at a higher
> > > priority than the system wide flexible events.
> > 
> > I'm thinking this patch is actively broken. groups->index 'group' wide
> > and therefore across cpu/cgroup boundaries.
> > 
> > There is no !cgroup to cgroup hierarchy as this patch seems to assume,
> > specifically look at how the merge sort in visit_groups_merge() allows
> > cgroup events to be picked before !cgroup events.
> 
> 
> No, the patch intends to avoid switch !cgroup during cgroup context switch.

Which is wrong.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14 13:57       ` Peter Zijlstra
@ 2019-11-14 15:16         ` Liang, Kan
  2019-11-14 15:24           ` Liang, Kan
  0 siblings, 1 reply; 80+ messages in thread
From: Liang, Kan @ 2019-11-14 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, linux-kernel,
	Stephane Eranian, Andi Kleen



On 11/14/2019 8:57 AM, Peter Zijlstra wrote:
> On Thu, Nov 14, 2019 at 08:46:51AM -0500, Liang, Kan wrote:
>>
>>
>> On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> When counting system-wide events and cgroup events simultaneously, the
>>>> system-wide events are always scheduled out then back in during cgroup
>>>> switches, bringing extra overhead and possibly missing events. Switching
>>>> out system wide flexible events may be necessary if the scheduled in
>>>> task's cgroups have pinned events that need to be scheduled in at a higher
>>>> priority than the system wide flexible events.
>>>
>>> I'm thinking this patch is actively broken. groups->index 'group' wide
>>> and therefore across cpu/cgroup boundaries.
>>>
>>> There is no !cgroup to cgroup hierarchy as this patch seems to assume,
>>> specifically look at how the merge sort in visit_groups_merge() allows
>>> cgroup events to be picked before !cgroup events.
>>
>>
>> No, the patch intends to avoid switch !cgroup during cgroup context switch.
> 
> Which is wrong.
> 
Why we want to switch !cgroup system-wide event in context switch?

How should current perf handle this case?
For example,
User A: perf stat -e cycles -G cgroup1
User B: perf stat -e instructions -a

There is only one cpuctx for each CPU. So both cycles and instructions 
are tracked in flexible_active list.
When user A left, the cgroup context-switch schedule out everything 
including both cycles and instructions.
It seems that we will never switch the instructions event back for user B.


Thanks,
Kan

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14 15:16         ` Liang, Kan
@ 2019-11-14 15:24           ` Liang, Kan
  2019-11-14 20:49             ` Liang, Kan
  0 siblings, 1 reply; 80+ messages in thread
From: Liang, Kan @ 2019-11-14 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, linux-kernel,
	Stephane Eranian, Andi Kleen



On 11/14/2019 10:16 AM, Liang, Kan wrote:
> 
> 
> On 11/14/2019 8:57 AM, Peter Zijlstra wrote:
>> On Thu, Nov 14, 2019 at 08:46:51AM -0500, Liang, Kan wrote:
>>>
>>>
>>> On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> When counting system-wide events and cgroup events simultaneously, the
>>>>> system-wide events are always scheduled out then back in during cgroup
>>>>> switches, bringing extra overhead and possibly missing events. 
>>>>> Switching
>>>>> out system wide flexible events may be necessary if the scheduled in
>>>>> task's cgroups have pinned events that need to be scheduled in at a 
>>>>> higher
>>>>> priority than the system wide flexible events.
>>>>
>>>> I'm thinking this patch is actively broken. groups->index 'group' wide
>>>> and therefore across cpu/cgroup boundaries.
>>>>
>>>> There is no !cgroup to cgroup hierarchy as this patch seems to assume,
>>>> specifically look at how the merge sort in visit_groups_merge() allows
>>>> cgroup events to be picked before !cgroup events.
>>>
>>>
>>> No, the patch intends to avoid switch !cgroup during cgroup context 
>>> switch.
>>
>> Which is wrong.
>>
> Why we want to switch !cgroup system-wide event in context switch?
> 
> How should current perf handle this case?

This is not a right case. I will think about more.
Sorry for the noise.

Thanks,
Kan

> For example,
> User A: perf stat -e cycles -G cgroup1
> User B: perf stat -e instructions -a
> 
> There is only one cpuctx for each CPU. So both cycles and instructions 
> are tracked in flexible_active list.
> When user A left, the cgroup context-switch schedule out everything 
> including both cycles and instructions.
> It seems that we will never switch the instructions event back for user B.





^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 00/10] Optimize cgroup context switch
  2019-11-14 10:45 ` Peter Zijlstra
@ 2019-11-14 18:17   ` Ian Rogers
  2019-12-06 23:16     ` Ian Rogers
  0 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-14 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 2:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 13, 2019 at 04:30:32PM -0800, Ian Rogers wrote:
> > Avoid iterating over all per-CPU events during cgroup changing context
> > switches by organizing events by cgroup.
>
> When last we spoke (Plumbers in Lisbon) you mentioned that this
> optimization was yielding far less than expected. You had graphs showing
> how the use of cgroups impacted event scheduling time and how this patch
> set only reduced that a little.
>
> Any update on all that? There seems to be a conspicuous lack of such
> data here.

I'm working on giving an update on the numbers but I suspect they are
better than I'd measured ahead of LPC due to a bug in a script.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-14 15:24           ` Liang, Kan
@ 2019-11-14 20:49             ` Liang, Kan
  0 siblings, 0 replies; 80+ messages in thread
From: Liang, Kan @ 2019-11-14 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, linux-kernel,
	Stephane Eranian, Andi Kleen



On 11/14/2019 10:24 AM, Liang, Kan wrote:
> 
> 
> On 11/14/2019 10:16 AM, Liang, Kan wrote:
>>
>>
>> On 11/14/2019 8:57 AM, Peter Zijlstra wrote:
>>> On Thu, Nov 14, 2019 at 08:46:51AM -0500, Liang, Kan wrote:
>>>>
>>>>
>>>> On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
>>>>> On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> When counting system-wide events and cgroup events simultaneously, 
>>>>>> the
>>>>>> system-wide events are always scheduled out then back in during 
>>>>>> cgroup
>>>>>> switches, bringing extra overhead and possibly missing events. 
>>>>>> Switching
>>>>>> out system wide flexible events may be necessary if the scheduled in
>>>>>> task's cgroups have pinned events that need to be scheduled in at 
>>>>>> a higher
>>>>>> priority than the system wide flexible events.
>>>>>
>>>>> I'm thinking this patch is actively broken. groups->index 'group' wide
>>>>> and therefore across cpu/cgroup boundaries.
>>>>>
>>>>> There is no !cgroup to cgroup hierarchy as this patch seems to assume,
>>>>> specifically look at how the merge sort in visit_groups_merge() allows
>>>>> cgroup events to be picked before !cgroup events.
>>>>
>>>>
>>>> No, the patch intends to avoid switch !cgroup during cgroup context 
>>>> switch.
>>>
>>> Which is wrong.
>>>
>> Why we want to switch !cgroup system-wide event in context switch?
>>
>> How should current perf handle this case?
> 

It seems hard to find a simple case to explain why we should not switch 
!cgroup during cgroup context switch.

Let me try to explain it using ftrace.

Case 1:
User A do system-wide monitoring for 1 second. No other users.
      #perf stat -e branches -a -- sleep 1

The counter counts between 765531.617703 and 765532.620184.
Everything is collected.

            <...>-59160 [027] d.h. 765531.617697: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            <...>-59160 [027] d.h. 765531.617701: write_msr: 
MSR_IA32_PMC0(4c1), value 800000000001
            <...>-59160 [027] d.h. 765531.617702: write_msr: 
MSR_P6_EVNTSEL0(186), value 5300c4
            <...>-59160 [027] d.h. 765531.617703: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
           <idle>-0     [027] d.h. 765532.620184: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
           <idle>-0     [027] d.h. 765532.620185: write_msr: 
MSR_P6_EVNTSEL0(186), value 1300c4
           <idle>-0     [027] d.h. 765532.620186: rdpmc: 0, value 
80000b3e87a4
           <idle>-0     [027] d.h. 765532.620187: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f


Case 2:
User A do system-wide monitoring for 1 second.
      #perf stat -e branches -a -- sleep 1
At the meantime, User B do cgroup monitoring.
      #perf stat -e cycles -G cgroup

The User A expects to collect everything from 765580.196521 to 
765581.198150. But it doesn't.

Because of cgroup context switch, the system-wide event for user A stops 
counting at [765580.213882, 765580.213884],
[765580.213913, 765580.213915], ..., [765580.774304, 765580.774307].

I think it breaks the usage of User A.

Furthermore, switching !cgroup system-wide event also brings extra 
overhead, which is unnecessary.

            <...>-121292 [027] d.h. 765580.196514: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            <...>-121292 [027] d.h. 765580.196519: write_msr: 
MSR_IA32_PMC0(4c1), value 800000000001
            <...>-121292 [027] d.h. 765580.196520: write_msr: 
MSR_P6_EVNTSEL0(186), value 5300c4
            <...>-121292 [027] d.h. 765580.196521: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
           <idle>-0     [027] d... 765580.213878: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
           <idle>-0     [027] d... 765580.213880: write_msr: 
MSR_P6_EVNTSEL0(186), value 1300c4
           <idle>-0     [027] d... 765580.213880: rdpmc: 0, value 
800000357bc1
           <idle>-0     [027] d... 765580.213882: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
      simics-poll-25601 [027] d... 765580.213884: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
      simics-poll-25601 [027] d... 765580.213888: write_msr: 
MSR_CORE_PERF_FIXED_CTR1(30a), value 800015820cbe
      simics-poll-25601 [027] d... 765580.213889: read_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value 0
      simics-poll-25601 [027] d... 765580.213890: write_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value b0
      simics-poll-25601 [027] d... 765580.213890: write_msr: 
MSR_IA32_PMC0(4c1), value 800000357bc1
      simics-poll-25601 [027] d... 765580.213891: write_msr: 
MSR_P6_EVNTSEL0(186), value 5300c4
      simics-poll-25601 [027] d... 765580.213892: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
      simics-poll-25601 [027] d... 765580.213910: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
      simics-poll-25601 [027] d... 765580.213911: read_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value b0
      simics-poll-25601 [027] d... 765580.213911: write_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value 0
      simics-poll-25601 [027] d... 765580.213911: rdpmc: 40000001, value 
80001582b676
      simics-poll-25601 [027] d... 765580.213912: write_msr: 
MSR_P6_EVNTSEL0(186), value 1300c4
      simics-poll-25601 [027] d... 765580.213913: rdpmc: 0, value 
800000358491
      simics-poll-25601 [027] d... 765580.213913: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
           <idle>-0     [027] d... 765580.213915: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
           <idle>-0     [027] d... 765580.213916: write_msr: 
MSR_IA32_PMC0(4c1), value 800000358491
           <idle>-0     [027] d... 765580.213916: write_msr: 
MSR_P6_EVNTSEL0(186), value 5300c4
           <idle>-0     [027] d... 765580.213917: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f

... ...

      simics-poll-25601 [027] d... 765580.774301: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
      simics-poll-25601 [027] d... 765580.774302: read_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value b0
      simics-poll-25601 [027] d... 765580.774302: write_msr: 
MSR_CORE_PERF_FIXED_CTR_CTRL(38d), value 0
      simics-poll-25601 [027] d... 765580.774302: rdpmc: 40000001, value 
8000165e927b
      simics-poll-25601 [027] d... 765580.774303: write_msr: 
MSR_P6_EVNTSEL0(186), value 1300c4
      simics-poll-25601 [027] d... 765580.774303: rdpmc: 0, value 
8000059298ce
      simics-poll-25601 [027] d... 765580.774304: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
            <...>-135379 [027] d... 765580.774307: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            <...>-135379 [027] d... 765580.774308: write_msr: 
MSR_IA32_PMC0(4c1), value 8000059298ce
            <...>-135379 [027] d... 765580.774309: write_msr: 
MSR_P6_EVNTSEL0(186), value 5300c4
            <...>-135379 [027] d... 765580.774309: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f
            <...>-147127 [027] d.h. 765581.198150: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
            <...>-147127 [027] d.h. 765581.198153: write_msr: 
MSR_P6_EVNTSEL0(186), value 1300c4
            <...>-147127 [027] d.h. 765581.198153: rdpmc: 0, value 
80000a573368
            <...>-147127 [027] d.h. 765581.198155: write_msr: 
MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f


Thanks,
Kan






^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v4 00/10] Optimize cgroup context switch
  2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
                   ` (11 preceding siblings ...)
  2019-11-14 10:45 ` Peter Zijlstra
@ 2019-11-16  1:18 ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
                     ` (10 more replies)
  12 siblings, 11 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Avoid iterating over all per-CPU events during cgroup changing context
switches by organizing events by cgroup.

To make an efficient set of iterators, introduce a min max heap
utility with test.

The v4 patch set addresses review comments on the v3 patch set by
Peter Zijlstra.

These patches include a caching algorithm to improve the search for
the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
well as rebasing hit "optimize event_filter_match during sched_in"
from https://lkml.org/lkml/2019/8/7/771.

The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git

These patches follow Peter's reorganization and his fixes to the
perf_cpu_context min_heap storage code.

Ian Rogers (8):
  lib: introduce generic min max heap
  perf: Use min_max_heap in visit_groups_merge
  perf: Add per perf_cpu_context min_heap storage
  perf/cgroup: Grow per perf_cpu_context heap storage
  perf/cgroup: Order events in RB tree by cgroup id
  perf: simplify and rename visit_groups_merge
  perf: cache perf_event_groups_first for cgroups
  perf: optimize event_filter_match during sched_in

Kan Liang (1):
  perf/cgroup: Do not switch system-wide events in cgroup switch

Peter Zijlstra (1):
  perf/cgroup: Reorder perf_cgroup_connect()

 include/linux/min_max_heap.h | 133 +++++++++
 include/linux/perf_event.h   |  15 +
 kernel/events/core.c         | 543 +++++++++++++++++++++++++++--------
 lib/Kconfig.debug            |  10 +
 lib/Makefile                 |   1 +
 lib/test_min_max_heap.c      | 194 +++++++++++++
 6 files changed, 782 insertions(+), 114 deletions(-)
 create mode 100644 include/linux/min_max_heap.h
 create mode 100644 lib/test_min_max_heap.c

-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v4 01/10] perf/cgroup: Reorder perf_cgroup_connect()
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 02/10] lib: introduce generic min max heap Ian Rogers
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Peter Zijlstra <peterz@infradead.org>

Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfd89b4a02d8..0dce28b0aae0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10597,12 +10597,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
-
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
@@ -10615,6 +10609,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_pmu;
 	}
 
+	if (cgroup_fd != -1) {
+		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+		if (err)
+			goto err_pmu;
+	}
+
 	err = exclusive_event_init(event);
 	if (err)
 		goto err_pmu;
@@ -10675,12 +10675,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	if (event->hw.target)
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 02/10] lib: introduce generic min max heap
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-21 11:11     ` Joe Perches
  2019-11-16  1:18   ` [PATCH v4 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Supports push, pop and converting an array into a heap.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/min_max_heap.h | 133 ++++++++++++++++++++++++
 lib/Kconfig.debug            |  10 ++
 lib/Makefile                 |   1 +
 lib/test_min_max_heap.c      | 194 +++++++++++++++++++++++++++++++++++
 4 files changed, 338 insertions(+)
 create mode 100644 include/linux/min_max_heap.h
 create mode 100644 lib/test_min_max_heap.c

diff --git a/include/linux/min_max_heap.h b/include/linux/min_max_heap.h
new file mode 100644
index 000000000000..e4db94bd89d6
--- /dev/null
+++ b/include/linux/min_max_heap.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MIN_MAX_HEAP_H
+#define _LINUX_MIN_MAX_HEAP_H
+
+#include <linux/bug.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+/**
+ * struct min_max_heap - Data structure to hold a min or max heap.
+ * @data: Start of array holding the heap elements.
+ * @size: Number of elements currently in the heap.
+ * @cap: Maximum number of elements that can be held in current storage.
+ */
+struct min_max_heap {
+	void *data;
+	int size;
+	int cap;
+};
+
+/**
+ * struct min_max_heap_callbacks - Data/functions to customise the min_max_heap.
+ * @elem_size: The size of each element in bytes.
+ * @cmp: Partial order function for this heap 'less'/'<' for min-heap,
+ *       'greater'/'>' for max-heap.
+ * @swp: Swap elements function.
+ */
+struct min_max_heap_callbacks {
+	int elem_size;
+	bool (*cmp)(const void *lhs, const void *rhs);
+	void (*swp)(void *lhs, void *rhs);
+};
+
+/* Sift the element at pos down the heap. */
+static inline void min_max_heapify(struct min_max_heap *heap, int pos,
+				const struct min_max_heap_callbacks *func)
+{
+	void *left_child, *right_child, *parent, *large_or_smallest;
+	u8 *data = (u8 *)heap->data;
+
+	for (;;) {
+		if (pos * 2 + 1 >= heap->size)
+			break;
+
+		left_child = data + ((pos * 2 + 1) * func->elem_size);
+		parent = data + (pos * func->elem_size);
+		large_or_smallest = parent;
+		if (func->cmp(left_child, large_or_smallest))
+			large_or_smallest = left_child;
+
+		if (pos * 2 + 2 < heap->size) {
+			right_child = data + ((pos * 2 + 2) * func->elem_size);
+			if (func->cmp(right_child, large_or_smallest))
+				large_or_smallest = right_child;
+		}
+		if (large_or_smallest == parent)
+			break;
+		func->swp(large_or_smallest, parent);
+		if (large_or_smallest == left_child)
+			pos = (pos * 2) + 1;
+		else
+			pos = (pos * 2) + 2;
+	}
+}
+
+/* Floyd's approach to heapification that is O(size). */
+static inline void
+min_max_heapify_all(struct min_max_heap *heap,
+	const struct min_max_heap_callbacks *func)
+{
+	int i;
+
+	for (i = heap->size / 2; i >= 0; i--)
+		min_max_heapify(heap, i, func);
+}
+
+/* Remove minimum element from the heap, O(log2(size)). */
+static inline void
+min_max_heap_pop(struct min_max_heap *heap,
+		const struct min_max_heap_callbacks *func)
+{
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size <= 0, "Popping an empty heap"))
+		return;
+
+	/* Place last element at the root (position 0) and then sift down. */
+	heap->size--;
+	memcpy(data, data + (heap->size * func->elem_size), func->elem_size);
+	min_max_heapify(heap, 0, func);
+}
+
+/*
+ * Remove the minimum element and then push the given element. The
+ * implementation performs 1 sift (O(log2(size))) and is therefore more
+ * efficient than a pop followed by a push that does 2.
+ */
+static void min_max_heap_pop_push(struct min_max_heap *heap,
+				const void *element,
+				const struct min_max_heap_callbacks *func)
+{
+	memcpy(heap->data, element, func->elem_size);
+	min_max_heapify(heap, 0, func);
+}
+
+/* Push an element on to the heap, O(log2(size)). */
+static inline void
+min_max_heap_push(struct min_max_heap *heap, const void *element,
+		const struct min_max_heap_callbacks *func)
+{
+	void *child, *parent;
+	int pos;
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
+		return;
+
+	/* Place at the end of data. */
+	pos = heap->size;
+	memcpy(data + (pos * func->elem_size), element, func->elem_size);
+	heap->size++;
+
+	/* Sift child at pos up. */
+	for (; pos > 0; pos = (pos - 1) / 2) {
+		child = data + (pos * func->elem_size);
+		parent = data + ((pos - 1) / 2) * func->elem_size;
+		if (func->cmp(parent, child))
+			break;
+		func->swp(parent, child);
+	}
+}
+
+#endif /* _LINUX_MIN_MAX_HEAP_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93d97f9b0157..6a2cf82515eb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1693,6 +1693,16 @@ config TEST_LIST_SORT
 
 	  If unsure, say N.
 
+config TEST_MIN_MAX_HEAP
+	tristate "Min-max heap test"
+	depends on DEBUG_KERNEL || m
+	help
+	  Enable this to turn on min-max heap function tests. This test is
+	  executed only once during system boot (so affects only boot time),
+	  or at module load time.
+
+	  If unsure, say N.
+
 config TEST_SORT
 	tristate "Array-based sort test"
 	depends on DEBUG_KERNEL || m
diff --git a/lib/Makefile b/lib/Makefile
index c5892807e06f..e73df06adaab 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -67,6 +67,7 @@ CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
+obj-$(CONFIG_TEST_MIN_MAX_HEAP) += test_min_max_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
diff --git a/lib/test_min_max_heap.c b/lib/test_min_max_heap.c
new file mode 100644
index 000000000000..175def1c2fae
--- /dev/null
+++ b/lib/test_min_max_heap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "min_max_heap_test: " fmt
+
+/*
+ * Test cases for the min max heap.
+ */
+
+#include <linux/log2.h>
+#include <linux/min_max_heap.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+
+static __init bool less_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs < *(int *)rhs;
+}
+
+static __init bool greater_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs > *(int *)rhs;
+}
+
+static __init void swap_ints(void *lhs, void *rhs)
+{
+	int temp = *(int *)lhs;
+
+	*(int *)lhs = *(int *)rhs;
+	*(int *)rhs = temp;
+}
+
+static __init int pop_verify_heap(bool min_heap,
+				struct min_max_heap *heap,
+				const struct min_max_heap_callbacks *funcs)
+{
+	int last;
+	int *values = (int *)heap->data;
+	int err = 0;
+
+	last = values[0];
+	min_max_heap_pop(heap, funcs);
+	while (heap->size > 0) {
+		if (min_heap) {
+			if (last > values[0]) {
+				pr_err("error: expected %d <= %d\n", last,
+					values[0]);
+				err++;
+			}
+		} else {
+			if (last < values[0]) {
+				pr_err("error: expected %d >= %d\n", last,
+					values[0]);
+				err++;
+			}
+		}
+		last = values[0];
+		min_max_heap_pop(heap, funcs);
+	}
+	return err;
+}
+
+static __init int test_heapify_all(bool min_heap)
+{
+	int values[] = { 3, 1, 2, 4, 0x8000000, 0x7FFFFFF, 0,
+			 -3, -1, -2, -4, 0x8000000, 0x7FFFFFF };
+	struct min_max_heap heap = {
+		.data = values,
+		.size = ARRAY_SIZE(values),
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, err;
+
+	/* Test with known set of values. */
+	min_max_heapify_all(&heap, &funcs);
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+
+	/* Test with randomly generated values. */
+	heap.size = ARRAY_SIZE(values);
+	for (i = 0; i < heap.size; i++)
+		values[i] = get_random_int();
+
+	min_max_heapify_all(&heap, &funcs);
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_max_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_max_heap_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	/* Test with randomly generated values. */
+	while (heap.size < heap.cap) {
+		temp = get_random_int();
+		min_max_heap_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_pop_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_max_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_max_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Fill values with data to pop and replace. */
+	temp = min_heap ? 0x80000000 : 0x7FFFFFFF;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_max_heap_push(&heap, &temp, &funcs);
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_max_heap_pop_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	heap.size = 0;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_max_heap_push(&heap, &temp, &funcs);
+
+	/* Test with randomly generated values. */
+	for (i = 0; i < ARRAY_SIZE(data); i++) {
+		temp = get_random_int();
+		min_max_heap_pop_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static int __init test_min_max_heap_init(void)
+{
+	int err = 0;
+
+	err += test_heapify_all(true);
+	err += test_heapify_all(false);
+	err += test_heap_push(true);
+	err += test_heap_push(false);
+	err += test_heap_pop_push(true);
+	err += test_heap_pop_push(false);
+	if (err) {
+		pr_err("test failed with %d errors\n", err);
+		return -EINVAL;
+	}
+	pr_info("test passed\n");
+	return 0;
+}
+module_init(test_min_max_heap_init);
+
+static void __exit test_min_max_heap_exit(void)
+{
+	/* do nothing */
+}
+module_exit(test_min_max_heap_exit);
+
+MODULE_LICENSE("GPL");
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 03/10] perf: Use min_max_heap in visit_groups_merge
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 02/10] lib: introduce generic min max heap Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 72 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 18 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0dce28b0aae0..b0e89a488e3d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/min_max_heap.h>
 
 #include "internal.h"
 
@@ -3372,32 +3373,67 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static int visit_groups_merge(struct perf_event_groups *groups, int cpu,
-			      int (*func)(struct perf_event *, void *), void *data)
+static bool perf_cmp_group_idx(const void *l, const void *r)
 {
-	struct perf_event **evt, *evt1, *evt2;
+	const struct perf_event *le = l, *re = r;
+
+	return le->group_index < re->group_index;
+}
+
+static void swap_ptr(void *l, void *r)
+{
+	void **lp = l, **rp = r;
+
+	swap(*lp, *rp);
+}
+
+static const struct min_max_heap_callbacks perf_min_heap = {
+	.elem_size = sizeof(struct perf_event *),
+	.cmp = perf_cmp_group_idx,
+	.swp = swap_ptr,
+};
+
+static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
+{
+	struct perf_event **itrs = heap->data;
+
+	if (event) {
+		itrs[heap->size] = event;
+		heap->size++;
+	}
+}
+
+static noinline int visit_groups_merge(struct perf_event_groups *groups,
+				int cpu,
+				int (*func)(struct perf_event *, void *),
+				void *data)
+{
+	/* Space for per CPU and/or any CPU event iterators. */
+	struct perf_event *itrs[2];
+	struct min_max_heap event_heap = {
+		.data = itrs,
+		.size = 0,
+		.cap = ARRAY_SIZE(itrs),
+	};
+	struct perf_event *next;
 	int ret;
 
-	evt1 = perf_event_groups_first(groups, -1);
-	evt2 = perf_event_groups_first(groups, cpu);
+	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	while (evt1 || evt2) {
-		if (evt1 && evt2) {
-			if (evt1->group_index < evt2->group_index)
-				evt = &evt1;
-			else
-				evt = &evt2;
-		} else if (evt1) {
-			evt = &evt1;
-		} else {
-			evt = &evt2;
-		}
+	min_max_heapify_all(&event_heap, &perf_min_heap);
 
-		ret = func(*evt, data);
+	while (event_heap.size) {
+		ret = func(itrs[0], data);
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		next = perf_event_groups_next(itrs[0]);
+		if (next) {
+			min_max_heap_pop_push(&event_heap, &next,
+					&perf_min_heap);
+		} else
+			min_max_heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 04/10] perf: Add per perf_cpu_context min_heap storage
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (2 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 ++++++
 kernel/events/core.c       | 46 ++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbdbccc2..b3580afbf358 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -835,6 +835,13 @@ struct perf_cpu_context {
 	int				sched_cb_usage;
 
 	int				online;
+	/*
+	 * Per-CPU storage for iterators used in visit_groups_merge. The default
+	 * storage is of size 2 to hold the CPU and any CPU event iterators.
+	 */
+	int				itr_storage_cap;
+	struct perf_event		**itr_storage;
+	struct perf_event		*itr_default[2];
 };
 
 struct perf_output_handle {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0e89a488e3d..a1c44d09eff8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3403,32 +3403,45 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_event_groups *groups,
-				int cpu,
+static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+				struct perf_event_groups *groups, int cpu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
-	struct min_max_heap event_heap = {
-		.data = itrs,
-		.size = 0,
-		.cap = ARRAY_SIZE(itrs),
-	};
+	struct min_max_heap event_heap;
+	struct perf_event **evt;
 	struct perf_event *next;
 	int ret;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	if (cpuctx) {
+		event_heap = (struct min_max_heap){
+			.data = cpuctx->itr_storage,
+			.size = 0,
+			.cap = cpuctx->itr_storage_cap,
+		};
+	} else {
+		event_heap = (struct min_max_heap){
+			.data = itrs,
+			.size = 0,
+			.cap = ARRAY_SIZE(itrs),
+		};
+		/* Events not within a CPU context may be on any CPU. */
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	}
+	evt = event_heap.data;
+
 	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
 	min_max_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(itrs[0], data);
+		ret = func(*evt, data);
 		if (ret)
 			return ret;
 
-		next = perf_event_groups_next(itrs[0]);
+		next = perf_event_groups_next(*evt);
 		if (next) {
 			min_max_heap_pop_push(&event_heap, &next,
 					&perf_min_heap);
@@ -3503,7 +3516,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->pinned_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->pinned_groups,
 			   smp_processor_id(),
 			   pinned_sched_in, &sid);
 }
@@ -3518,7 +3534,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->flexible_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->flexible_groups,
 			   smp_processor_id(),
 			   flexible_sched_in, &sid);
 }
@@ -10185,6 +10204,9 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
+
+		cpuctx->itr_storage_cap = ARRAY_SIZE(cpuctx->itr_default);
+		cpuctx->itr_storage = cpuctx->itr_default;
 	}
 
 got_cpu_context:
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 05/10] perf/cgroup: Grow per perf_cpu_context heap storage
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (3 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a1c44d09eff8..8817c645bef9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
 	rcu_read_unlock();
 }
 
+static int perf_cgroup_ensure_storage(struct perf_event *event,
+				struct cgroup_subsys_state *css)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_event **storage;
+	int cpu, itr_cap, ret = 0;
+
+	/*
+	 * Allow storage to have sufficent space for an iterator for each
+	 * possibly nested cgroup plus an iterator for events with no cgroup.
+	 */
+	for (itr_cap = 1; css; css = css->parent)
+		itr_cap++;
+
+	for_each_possible_cpu(cpu) {
+		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		if (itr_cap <= cpuctx->itr_storage_cap)
+			continue;
+
+		storage = kmalloc_node(itr_cap * sizeof(struct perf_event *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+		if (!storage) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		raw_spin_lock_irq(&cpuctx->ctx.lock);
+		if (cpuctx->itr_storage_cap < itr_cap) {
+			swap(cpuctx->itr_storage, storage);
+			if (storage == cpuctx->itr_default)
+				storage = NULL;
+			cpuctx->itr_storage_cap = itr_cap;
+		}
+		raw_spin_unlock_irq(&cpuctx->ctx.lock);
+
+		kfree(storage);
+	}
+
+	return ret;
+}
+
 static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		goto out;
 	}
 
+	ret = perf_cgroup_ensure_storage(event, css);
+	if (ret)
+		goto out;
+
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
@@ -3421,6 +3466,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.size = 0,
 			.cap = cpuctx->itr_storage_cap,
 		};
+
+		lockdep_assert_held(&cpuctx->ctx.lock);
 	} else {
 		event_heap = (struct min_max_heap){
 			.data = itrs,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 06/10] perf/cgroup: Order events in RB tree by cgroup id
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (4 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
---
 kernel/events/core.c | 97 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8817c645bef9..f26871ef352a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1576,6 +1576,30 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right)
 	if (left->cpu > right->cpu)
 		return false;
 
+#ifdef CONFIG_CGROUP_PERF
+	if (left->cgrp != right->cgrp) {
+		if (!left->cgrp || !left->cgrp->css.cgroup) {
+			/*
+			 * Left has no cgroup but right does, no cgroups come
+			 * first.
+			 */
+			return true;
+		}
+		if (!right->cgrp || right->cgrp->css.cgroup) {
+			/*
+			 * Right has no cgroup but left does, no cgroups come
+			 * first.
+			 */
+			return false;
+		}
+		/* Two dissimilar cgroups, order by id. */
+		if (left->cgrp->css.cgroup->id < right->cgrp->css.cgroup->id)
+			return true;
+
+		return false;
+	}
+#endif
+
 	if (left->group_index < right->group_index)
 		return true;
 	if (left->group_index > right->group_index)
@@ -1655,25 +1679,48 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the @cpu subtree.
+ * Get the leftmost event in the cpu/cgroup subtree.
  */
 static struct perf_event *
-perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+perf_event_groups_first(struct perf_event_groups *groups, int cpu,
+			struct cgroup *cgrp)
 {
 	struct perf_event *node_event = NULL, *match = NULL;
 	struct rb_node *node = groups->tree.rb_node;
+#ifdef CONFIG_CGROUP_PERF
+	int node_cgrp_id, cgrp_id = 0;
+
+	if (cgrp)
+		cgrp_id = cgrp->id;
+#endif
 
 	while (node) {
 		node_event = container_of(node, struct perf_event, group_node);
 
 		if (cpu < node_event->cpu) {
 			node = node->rb_left;
-		} else if (cpu > node_event->cpu) {
+			continue;
+		}
+		if (cpu > node_event->cpu) {
 			node = node->rb_right;
-		} else {
-			match = node_event;
+			continue;
+		}
+#ifdef CONFIG_CGROUP_PERF
+		node_cgrp_id = 0;
+		if (node_event->cgrp && node_event->cgrp->css.cgroup)
+			node_cgrp_id = node_event->cgrp->css.cgroup->id;
+
+		if (cgrp_id < node_cgrp_id) {
 			node = node->rb_left;
+			continue;
+		}
+		if (cgrp_id > node_cgrp_id) {
+			node = node->rb_right;
+			continue;
 		}
+#endif
+		match = node_event;
+		node = node->rb_left;
 	}
 
 	return match;
@@ -1686,12 +1733,26 @@ static struct perf_event *
 perf_event_groups_next(struct perf_event *event)
 {
 	struct perf_event *next;
+#ifdef CONFIG_CGROUP_PERF
+	int curr_cgrp_id = 0;
+	int next_cgrp_id = 0;
+#endif
 
 	next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node);
-	if (next && next->cpu == event->cpu)
-		return next;
+	if (next == NULL || next->cpu != event->cpu)
+		return NULL;
 
-	return NULL;
+#ifdef CONFIG_CGROUP_PERF
+	if (event->cgrp && event->cgrp->css.cgroup)
+		curr_cgrp_id = event->cgrp->css.cgroup->id;
+
+	if (next->cgrp && next->cgrp->css.cgroup)
+		next_cgrp_id = next->cgrp->css.cgroup->id;
+
+	if (curr_cgrp_id != next_cgrp_id)
+		return NULL;
+#endif
+	return next;
 }
 
 /*
@@ -3453,6 +3514,9 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
+#ifdef CONFIG_CGROUP_PERF
+	struct cgroup_subsys_state *css = NULL;
+#endif
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_max_heap event_heap;
@@ -3468,6 +3532,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 		};
 
 		lockdep_assert_held(&cpuctx->ctx.lock);
+
+#ifdef CONFIG_CGROUP_PERF
+		if (cpuctx->cgrp)
+			css = &cpuctx->cgrp->css;
+#endif
 	} else {
 		event_heap = (struct min_max_heap){
 			.data = itrs,
@@ -3475,11 +3544,19 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
+									NULL));
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+
+#ifdef CONFIG_CGROUP_PERF
+	for (; css; css = css->parent) {
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
+								css->cgroup));
+	}
+#endif
 
 	min_max_heapify_all(&event_heap, &perf_min_heap);
 
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 07/10] perf: simplify and rename visit_groups_merge
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (5 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

To enable a future caching optimization, pass in whether
visit_groups_merge is operating on pinned or flexible groups. The
is_pinned argument makes the func argument redundant, rename the
function to ctx_groups_sched_in as it just schedules pinned or flexible
groups in. Compute the cpu and groups arguments locally to reduce the
argument list size. Remove sched_in_data as it repeats arguments already
passed in. Merge pinned_sched_in and flexible_sched_in and use the
pinned argument to determine the active list.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 149 ++++++++++++++-----------------------------
 1 file changed, 49 insertions(+), 100 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f26871ef352a..948a66967eb5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2118,7 +2118,6 @@ static void perf_group_detach(struct perf_event *event)
 
 		if (!RB_EMPTY_NODE(&event->group_node)) {
 			add_event_to_groups(sibling, event->ctx);
-
 			if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
 				struct list_head *list = sibling->attr.pinned ?
 					&ctx->pinned_active : &ctx->flexible_active;
@@ -2441,6 +2440,8 @@ event_sched_in(struct perf_event *event,
 {
 	int ret = 0;
 
+	WARN_ON_ONCE(event->ctx != ctx);
+
 	lockdep_assert_held(&ctx->lock);
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3509,10 +3510,42 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
-				struct perf_event_groups *groups, int cpu,
-				int (*func)(struct perf_event *, void *),
-				void *data)
+static int merge_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event,
+			bool is_pinned,
+			int *can_add_hw)
+{
+	WARN_ON_ONCE(event->ctx != ctx);
+
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return 0;
+
+	if (!event_filter_match(event))
+		return 0;
+
+	if (group_can_go_on(event, cpuctx, 1)) {
+		if (!group_sched_in(event, cpuctx, ctx)) {
+			list_add_tail(&event->active_list, is_pinned
+				? &ctx->pinned_active
+				: &ctx->flexible_active);
+		}
+	}
+
+	if (event->state == PERF_EVENT_STATE_INACTIVE) {
+		if (is_pinned)
+			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+
+		*can_add_hw = 0;
+		ctx->rotate_necessary = 1;
+	}
+
+	return 0;
+}
+
+static int ctx_groups_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			bool is_pinned)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3522,9 +3555,13 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	struct min_max_heap event_heap;
 	struct perf_event **evt;
 	struct perf_event *next;
-	int ret;
+	int ret, can_add_hw = 1;
+	int cpu = smp_processor_id();
+	struct perf_event_groups *groups = is_pinned
+		? &ctx->pinned_groups
+		: &ctx->flexible_groups;
 
-	if (cpuctx) {
+	if (ctx == &cpuctx->ctx) {
 		event_heap = (struct min_max_heap){
 			.data = cpuctx->itr_storage,
 			.size = 0,
@@ -3561,7 +3598,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	min_max_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(*evt, data);
+		ret = merge_sched_in(ctx, cpuctx, *evt, is_pinned, &can_add_hw);
+
 		if (ret)
 			return ret;
 
@@ -3576,96 +3614,6 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-struct sched_in_data {
-	struct perf_event_context *ctx;
-	struct perf_cpu_context *cpuctx;
-	int can_add_hw;
-};
-
-static int pinned_sched_in(struct perf_event *event, void *data)
-{
-	struct sched_in_data *sid = data;
-
-	if (event->state <= PERF_EVENT_STATE_OFF)
-		return 0;
-
-	if (!event_filter_match(event))
-		return 0;
-
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		if (!group_sched_in(event, sid->cpuctx, sid->ctx))
-			list_add_tail(&event->active_list, &sid->ctx->pinned_active);
-	}
-
-	/*
-	 * If this pinned group hasn't been scheduled,
-	 * put it in error state.
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-
-	return 0;
-}
-
-static int flexible_sched_in(struct perf_event *event, void *data)
-{
-	struct sched_in_data *sid = data;
-
-	if (event->state <= PERF_EVENT_STATE_OFF)
-		return 0;
-
-	if (!event_filter_match(event))
-		return 0;
-
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		int ret = group_sched_in(event, sid->cpuctx, sid->ctx);
-		if (ret) {
-			sid->can_add_hw = 0;
-			sid->ctx->rotate_necessary = 1;
-			return 0;
-		}
-		list_add_tail(&event->active_list, &sid->ctx->flexible_active);
-	}
-
-	return 0;
-}
-
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->pinned_groups,
-			   smp_processor_id(),
-			   pinned_sched_in, &sid);
-}
-
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->flexible_groups,
-			   smp_processor_id(),
-			   flexible_sched_in, &sid);
-}
-
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -3702,11 +3650,12 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 08/10] perf: cache perf_event_groups_first for cgroups
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (6 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Add a per-CPU cache of the pinned and flexible perf_event_groups_first
value for a cgroup avoiding an O(log(#perf events)) searches during
sched_in.

Based-on-work-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 ++++
 kernel/events/core.c       | 84 ++++++++++++++++++++++++++++++++++----
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b3580afbf358..be3ca69b3f69 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -877,6 +877,13 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	/*
+	 * A cache of the first event with the perf_cpu_context's
+	 * perf_event_context for the first event in pinned_groups or
+	 * flexible_groups. Avoids an rbtree search during sched_in.
+	 */
+	struct perf_event * __percpu    *pinned_event;
+	struct perf_event * __percpu    *flexible_event;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 948a66967eb5..37abfca18bd3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1638,6 +1638,27 @@ perf_event_groups_insert(struct perf_event_groups *groups,
 
 	rb_link_node(&event->group_node, parent, node);
 	rb_insert_color(&event->group_node, &groups->tree);
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned) {
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		} else {
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+		}
+		/*
+		 * Remember smallest, left-most, group index event. The
+		 * less-than condition is only possible if the group_index
+		 * overflows.
+		 */
+		if (!*cgrp_event ||
+			event->group_index < (*cgrp_event)->group_index)
+			*cgrp_event = event;
+	}
+#endif
 }
 
 /*
@@ -1652,6 +1673,9 @@ add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
 	perf_event_groups_insert(groups, event);
 }
 
+static struct perf_event *
+perf_event_groups_next(struct perf_event *event);
+
 /*
  * Delete a group from a tree.
  */
@@ -1662,6 +1686,22 @@ perf_event_groups_delete(struct perf_event_groups *groups,
 	WARN_ON_ONCE(RB_EMPTY_NODE(&event->group_node) ||
 		     RB_EMPTY_ROOT(&groups->tree));
 
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned) {
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		} else {
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+		}
+		if (*cgrp_event == event)
+			*cgrp_event = perf_event_groups_next(event);
+	}
+#endif
+
 	rb_erase(&event->group_node, &groups->tree);
 	init_event_group(event);
 }
@@ -1679,7 +1719,8 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the cpu subtree without a cgroup (ie task or
+ * system-wide).
  */
 static struct perf_event *
 perf_event_groups_first(struct perf_event_groups *groups, int cpu,
@@ -3581,8 +3622,8 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
-									NULL));
+		__heap_add(&event_heap,
+			perf_event_groups_first(groups, -1, NULL));
 	}
 	evt = event_heap.data;
 
@@ -3590,8 +3631,16 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 
 #ifdef CONFIG_CGROUP_PERF
 	for (; css; css = css->parent) {
-		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
-								css->cgroup));
+		struct perf_cgroup *cgrp;
+
+		/* root cgroup doesn't have events */
+		if (css->id == 1)
+			break;
+
+		cgrp = container_of(css, struct perf_cgroup, css);
+		__heap_add(&event_heap, is_pinned
+			? *per_cpu_ptr(cgrp->pinned_event, cpu)
+			: *per_cpu_ptr(cgrp->flexible_event, cpu));
 	}
 #endif
 
@@ -12493,18 +12542,35 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return ERR_PTR(-ENOMEM);
 
 	jc->info = alloc_percpu(struct perf_cgroup_info);
-	if (!jc->info) {
-		kfree(jc);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!jc->info)
+		goto free_jc;
+
+	jc->pinned_event = alloc_percpu(struct perf_event *);
+	if (!jc->pinned_event)
+		goto free_jc_info;
+
+	jc->flexible_event = alloc_percpu(struct perf_event *);
+	if (!jc->flexible_event)
+		goto free_jc_pinned;
 
 	return &jc->css;
+
+free_jc_pinned:
+	free_percpu(jc->pinned_event);
+free_jc_info:
+	free_percpu(jc->info);
+free_jc:
+	kfree(jc);
+
+	return ERR_PTR(-ENOMEM);
 }
 
 static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
 
+	free_percpu(jc->pinned_event);
+	free_percpu(jc->flexible_event);
 	free_percpu(jc->info);
 	kfree(jc);
 }
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 09/10] perf: optimize event_filter_match during sched_in
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (7 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-11-16  1:18   ` [PATCH v4 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The caller verified the CPU and cgroup so directly call
pmu_filter_match.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 37abfca18bd3..6427b16c95d0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2212,8 +2212,11 @@ static inline int pmu_filter_match(struct perf_event *event)
 static inline int
 event_filter_match(struct perf_event *event)
 {
-	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
-	       perf_cgroup_match(event) && pmu_filter_match(event);
+	if (event->cpu != -1 && event->cpu != smp_processor_id())
+		return 0;
+	if (!perf_cgroup_match(event))
+		return 0;
+	return pmu_filter_match(event);
 }
 
 static void
@@ -3562,7 +3565,11 @@ static int merge_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
-	if (!event_filter_match(event))
+	/*
+	 * Avoid full event_filter_match as the caller verified the CPU and
+	 * cgroup before calling.
+	 */
+	if (!pmu_filter_match(event))
 		return 0;
 
 	if (group_can_go_on(event, cpuctx, 1)) {
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v4 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (8 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
@ 2019-11-16  1:18   ` Ian Rogers
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Kan Liang <kan.liang@linux.intel.com>

When counting system-wide events and cgroup events simultaneously, the
system-wide events are always scheduled out then back in during cgroup
switches, bringing extra overhead and possibly missing events. Switching
out system wide flexible events may be necessary if the scheduled in
task's cgroups have pinned events that need to be scheduled in at a higher
priority than the system wide flexible events.

Here is test with 6 child cgroups (sibling cgroups), 1 parent cgroup
and system-wide events.
A specjbb benchmark is running in each child cgroup.
The perf command is as below.
   perf stat -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -G cgroup1,cgroup1,cgroup2,cgroup2,cgroup3,cgroup3
   -G cgroup4,cgroup4,cgroup5,cgroup5,cgroup6,cgroup6
   -G cgroup_parent,cgroup_parent
   -a -e cycles,instructions -I 1000

The average RT (Response Time) reported from specjbb is
used as key performance metrics. (The lower the better)
                                        RT(us)              Overhead
Baseline (no perf stat):                4286.9
Use cgroup perf, no patches:            4537.1                5.84%
Use cgroup perf, apply the patch:       4440.7                3.59%

Fixes: e5d1367f17ba ("perf: Add cgroup support")
---
This patch was rebased based on: https://lkml.org/lkml/2019/8/7/771
with some minor changes to comments made by: Ian Rogers
<irogers@google.com>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |   1 +
 kernel/events/core.c       | 133 ++++++++++++++++++++++++++++++++++---
 2 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index be3ca69b3f69..887abf387b54 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -877,6 +877,7 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	unsigned int			nr_pinned_event;
 	/*
 	 * A cache of the first event with the perf_cpu_context's
 	 * perf_event_context for the first event in pinned_groups or
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6427b16c95d0..d9c3d3280ad9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -362,8 +362,18 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
+
+	/* see perf_cgroup_switch() for details */
+	EVENT_CGROUP_FLEXIBLE_ONLY = 0x10,
+	EVENT_CGROUP_PINNED_ONLY = 0x20,
+	EVENT_CGROUP_ALL_ONLY = EVENT_CGROUP_FLEXIBLE_ONLY |
+				EVENT_CGROUP_PINNED_ONLY,
+
 };
 
+#define CGROUP_PINNED(type)	(type & EVENT_CGROUP_PINNED_ONLY)
+#define CGROUP_FLEXIBLE(type)	(type & EVENT_CGROUP_FLEXIBLE_ONLY)
+
 /*
  * perf_sched_events : >0 events exist
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
@@ -668,6 +678,20 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
 
 #ifdef CONFIG_CGROUP_PERF
 
+/* Skip system-wide CPU events if only cgroup events are required. */
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool is_pinned)
+{
+	if (event->cgrp)
+		return 0;
+	if (is_pinned)
+		return !!CGROUP_PINNED(event_type);
+	else
+		return !!CGROUP_FLEXIBLE(event_type);
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -694,6 +718,8 @@ perf_cgroup_match(struct perf_event *event)
 
 static inline void perf_detach_cgroup(struct perf_event *event)
 {
+	if (event->attr.pinned)
+		event->cgrp->nr_pinned_event--;
 	css_put(&event->cgrp->css);
 	event->cgrp = NULL;
 }
@@ -781,6 +807,22 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	}
 }
 
+/* Check if cgroup and its ancestor have pinned events attached */
+static bool
+cgroup_has_pinned_events(struct perf_cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct perf_cgroup *tmp_cgrp;
+
+	for (css = &cgrp->css; css; css = css->parent) {
+		tmp_cgrp = container_of(css, struct perf_cgroup, css);
+		if (tmp_cgrp->nr_pinned_event > 0)
+			return true;
+	}
+
+	return false;
+}
+
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 #define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
@@ -812,7 +854,22 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			/*
+			 * The system-wide events and cgroup events share the
+			 * same cpuctx groups. Decide which events to be
+			 * scheduled outbased on the types of events:
+			 * - EVENT_FLEXIBLE | EVENT_CGROUP_FLEXIBLE_ONLY:
+			 *   Only switch cgroup events from EVENT_FLEXIBLE
+			 *   groups.
+			 * - EVENT_PINNED | EVENT_CGROUP_PINNED_ONLY:
+			 *   Only switch cgroup events from EVENT_PINNED
+			 *   groups.
+			 * - EVENT_ALL | EVENT_CGROUP_ALL_ONLY:
+			 *   Only switch cgroup events from both EVENT_FLEXIBLE
+			 *   and EVENT_PINNED groups.
+			 */
+			cpu_ctx_sched_out(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -831,7 +888,23 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 */
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+
+			/*
+			 * To keep the priority order of cpu pinned then cpu
+			 * flexible, if the new cgroup has pinned events then
+			 * sched out all system-wide flexilbe events before
+			 * sched in all events.
+			 */
+			if (cgroup_has_pinned_events(cpuctx->cgrp)) {
+				cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_PINNED_ONLY,
+					task);
+			} else {
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY,
+					task);
+			}
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -959,6 +1032,9 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
+	if (event->attr.pinned)
+		cgrp->nr_pinned_event++;
+
 	/*
 	 * all events in a group must monitor
 	 * the same cgroup because a task belongs
@@ -1032,6 +1108,14 @@ list_update_cgroup_event(struct perf_event *event,
 
 #else /* !CONFIG_CGROUP_PERF */
 
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool pinned)
+{
+	return false;
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -3221,13 +3305,25 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->pinned_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, true)) {
+				ctx->is_active |= EVENT_PINNED;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->flexible_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, false)) {
+				ctx->is_active |= EVENT_FLEXIBLE;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3558,6 +3654,7 @@ static int merge_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
 			struct perf_event *event,
 			bool is_pinned,
+			enum event_type_t event_type,
 			int *can_add_hw)
 {
 	WARN_ON_ONCE(event->ctx != ctx);
@@ -3565,6 +3662,9 @@ static int merge_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
+	if (perf_cgroup_skip_switch(event_type, event, is_pinned))
+		return 0;
+
 	/*
 	 * Avoid full event_filter_match as the caller verified the CPU and
 	 * cgroup before calling.
@@ -3593,7 +3693,8 @@ static int merge_sched_in(struct perf_event_context *ctx,
 
 static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
-			bool is_pinned)
+			bool is_pinned,
+			enum event_type_t event_type)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3654,7 +3755,8 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 	min_max_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = merge_sched_in(ctx, cpuctx, *evt, is_pinned, &can_add_hw);
+		ret = merge_sched_in(ctx, cpuctx, *evt, event_type, is_pinned,
+				&can_add_hw);
 
 		if (ret)
 			return ret;
@@ -3676,6 +3778,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     enum event_type_t event_type,
 	     struct task_struct *task)
 {
+	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
 	int is_active = ctx->is_active;
 	u64 now;
 
@@ -3684,7 +3787,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= (ctx_event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3704,14 +3807,22 @@ ctx_sched_in(struct perf_event_context *ctx,
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
+	 *
+	 * System-wide events may not have been scheduled out for a cgroup
+	 * switch.  Unconditionally call sched_in() for cgroup events and
+	 * it will filter the events.
 	 */
-	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+	if ((is_active & EVENT_PINNED) || CGROUP_PINNED(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true,
+				CGROUP_PINNED(event_type));
+	}
 
 
 	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
+	if ((is_active & EVENT_FLEXIBLE) || CGROUP_FLEXIBLE(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false,
+				CGROUP_FLEXIBLE(event_type));
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.432.g9d3f5f5b63-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage
  2019-11-14  9:51   ` Peter Zijlstra
@ 2019-11-16  1:19     ` Ian Rogers
  0 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 13, 2019 at 04:30:36PM -0800, Ian Rogers wrote:
> > +     if (cpuctx) {
> > +             event_heap = (struct min_max_heap){
> > +                     .data = cpuctx->itr_storage,
> > +                     .size = 0,
>
> C guarantees that unnamed fields get to be 0

Agreed, this is kept here to aid readability. Do you feel strongly
about not having this? It appears to be kept elsewhere for clarity
too:
$ grep -r "\..*= 0," arch/ kernel/ tools/|wc -l
2528

> > +                     .cap = cpuctx->itr_storage_cap,
> > +             };
> > +     } else {
> > +             event_heap = (struct min_max_heap){
> > +                     .data = itrs,
> > +                     .size = 0,
>
> idem.
>
> > +                     .cap = ARRAY_SIZE(itrs),
> > +             };
> > +             /* Events not within a CPU context may be on any CPU. */
> > +             __heap_add(&event_heap, perf_event_groups_first(groups, -1));
> > +
>
> suprious whitespace

Done.



> > +     }
> > +     evt = event_heap.data;
> > +
> >       __heap_add(&event_heap, perf_event_groups_first(groups, cpu));

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 07/10] perf: simplify and rename visit_groups_merge
  2019-11-14 10:03   ` Peter Zijlstra
@ 2019-11-16  1:20     ` Ian Rogers
  0 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 2:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 13, 2019 at 04:30:39PM -0800, Ian Rogers wrote:
> > To enable a future caching optimization, pass in whether
> > visit_groups_merge is operating on pinned or flexible groups. The
> > is_pinned argument makes the func argument redundant, rename the
> > function to ctx_groups_sched_in as it just schedules pinned or flexible
> > groups in. Compute the cpu and groups arguments locally to reduce the
> > argument list size. Remove sched_in_data as it repeats arguments already
> > passed in. Remove the unused data argument to pinned_sched_in.
>
> Where did my first two patches go? Why aren't
> {pinned,flexible}_sched_in() merged?

I've merged these 2 patches except for the helper function which is
now trivial with the pinned boolean.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups
  2019-11-14 10:25   ` Peter Zijlstra
@ 2019-11-16  1:20     ` Ian Rogers
  2019-11-18  8:37       ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-11-16  1:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 2:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 13, 2019 at 04:30:40PM -0800, Ian Rogers wrote:
> > Add a per-CPU cache of the pinned and flexible perf_event_groups_first
> > value for a cgroup avoiding an O(log(#perf events)) searches during
> > sched_in.
> >
> > Based-on-work-by: Kan Liang <kan.liang@linux.intel.com>
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  include/linux/perf_event.h |  6 +++
> >  kernel/events/core.c       | 79 +++++++++++++++++++++++++++-----------
> >  2 files changed, 62 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index b3580afbf358..cfd0b320418c 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -877,6 +877,12 @@ struct perf_cgroup_info {
> >  struct perf_cgroup {
> >       struct cgroup_subsys_state      css;
> >       struct perf_cgroup_info __percpu *info;
> > +     /* A cache of the first event with the perf_cpu_context's
> > +      * perf_event_context for the first event in pinned_groups or
> > +      * flexible_groups. Avoids an rbtree search during sched_in.
> > +      */
>
> Broken comment style.

Done.

> > +     struct perf_event * __percpu    *pinned_event;
> > +     struct perf_event * __percpu    *flexible_event;
>
> Where is the actual storage allocated? There is a conspicuous lack of
> alloc_percpu() in this patch, see for example perf_cgroup_css_alloc()
> which fills out the above @info field.

Apologies, missed from Kan's original patch but was in v2. Added again.

> >  };
> >
> >  /*
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 11594d8bbb2e..9f0febf51d97 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1638,6 +1638,25 @@ perf_event_groups_insert(struct perf_event_groups *groups,
> >
> >       rb_link_node(&event->group_node, parent, node);
> >       rb_insert_color(&event->group_node, &groups->tree);
> > +#ifdef CONFIG_CGROUP_PERF
> > +     if (is_cgroup_event(event)) {
> > +             struct perf_event **cgrp_event;
> > +
> > +             if (event->attr.pinned)
> > +                     cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
> > +                                             event->cpu);
> > +             else
> > +                     cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
> > +                                             event->cpu);
>
> Codingstyle requires { } here (or just bust the line length a little).

Done.

> > +             /*
> > +              * Cgroup events for the same cgroup on the same CPU will
> > +              * always be inserted at the right because of bigger
> > +              * @groups->index. Only need to set *cgrp_event when it's NULL.
> > +              */
> > +             if (!*cgrp_event)
> > +                     *cgrp_event = event;
>
> I would feel much better if you had some actual leftmost logic in the
> insertion iteration.

Done. Also altered the comment to address the possibility of overflow.

> > +     }
> > +#endif
> >  }
> >
> >  /*
> > @@ -1652,6 +1671,9 @@ add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
> >       perf_event_groups_insert(groups, event);
> >  }
> >
> > +static struct perf_event *
> > +perf_event_groups_next(struct perf_event *event);
> > +
> >  /*
> >   * Delete a group from a tree.
> >   */
> > @@ -1662,6 +1684,22 @@ perf_event_groups_delete(struct perf_event_groups *groups,
> >       WARN_ON_ONCE(RB_EMPTY_NODE(&event->group_node) ||
> >                    RB_EMPTY_ROOT(&groups->tree));
> >
> > +#ifdef CONFIG_CGROUP_PERF
> > +     if (is_cgroup_event(event)) {
> > +             struct perf_event **cgrp_event;
> > +
> > +             if (event->attr.pinned)
> > +                     cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
> > +                                             event->cpu);
> > +             else
> > +                     cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
> > +                                             event->cpu);
>
> Codingstyle again.

Done.

> > +
> > +             if (*cgrp_event == event)
> > +                     *cgrp_event = perf_event_groups_next(event);
> > +     }
> > +#endif
> > +
> >       rb_erase(&event->group_node, &groups->tree);
> >       init_event_group(event);
> >  }
> > @@ -1679,20 +1717,14 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
> >  }
> >
> >  /*
> > - * Get the leftmost event in the cpu/cgroup subtree.
> > + * Get the leftmost event in the cpu subtree without a cgroup (ie task or
> > + * system-wide).
> >   */
> >  static struct perf_event *
> > -perf_event_groups_first(struct perf_event_groups *groups, int cpu,
> > -                     struct cgroup *cgrp)
> > +perf_event_groups_first_no_cgroup(struct perf_event_groups *groups, int cpu)
>
> I'm going to impose a function name length limit soon :/ That's insane
> (again).

Done, with the argument added back in.

> >  {
> >       struct perf_event *node_event = NULL, *match = NULL;
> >       struct rb_node *node = groups->tree.rb_node;
> > -#ifdef CONFIG_CGROUP_PERF
> > -     int node_cgrp_id, cgrp_id = 0;
> > -
> > -     if (cgrp)
> > -             cgrp_id = cgrp->id;
> > -#endif
> >
> >       while (node) {
> >               node_event = container_of(node, struct perf_event, group_node);
> > @@ -1706,18 +1738,10 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
> >                       continue;
> >               }
> >  #ifdef CONFIG_CGROUP_PERF
> > -             node_cgrp_id = 0;
> > -             if (node_event->cgrp && node_event->cgrp->css.cgroup)
> > -                     node_cgrp_id = node_event->cgrp->css.cgroup->id;
> > -
> > -             if (cgrp_id < node_cgrp_id) {
> > +             if (node_event->cgrp) {
> >                       node = node->rb_left;
> >                       continue;
> >               }
> > -             if (cgrp_id > node_cgrp_id) {
> > -                     node = node->rb_right;
> > -                     continue;
> > -             }
> >  #endif
> >               match = node_event;
> >               node = node->rb_left;
>
> Also, just leave that in and let callers have: .cgrp = NULL. Then you
> can forgo that monstrous name.

Done. It is a shame that there is some extra logic for the task/no-cgroup case.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
  2019-11-14  9:32   ` Peter Zijlstra
  2019-11-14  9:35   ` Peter Zijlstra
@ 2019-11-17 18:28   ` Joe Perches
  2019-11-18  8:40     ` Peter Zijlstra
  2 siblings, 1 reply; 80+ messages in thread
From: Joe Perches @ 2019-11-17 18:28 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Masahiro Yamada,
	Kees Cook, Catalin Marinas, Petr Mladek, Mauro Carvalho Chehab,
	Qian Cai, Joe Lawrence, Tetsuo Handa, Sri Krishna chowdary,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen

On Wed, 2019-11-13 at 16:30 -0800, Ian Rogers wrote:
> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Perhaps some functions are a bit large for inline
and perhaps the function names are too generic?

> diff --git a/include/linux/min_max_heap.h b/include/linux/min_max_heap.h
[]
> +/* Sift the element at pos down the heap. */
> +static inline void heapify(struct min_max_heap *heap, int pos,
> +			const struct min_max_heap_callbacks *func) {
> +	void *left_child, *right_child, *parent, *large_or_smallest;
> +	char *data = (char *)heap->data;

The kernel already depends on void * arithmetic so it
seems char *data could just as well be void *data and
it might be more readable without the temporary at all.

> +
> +	for (;;) {
> +		if (pos * 2 + 1 >= heap->size)
> +			break;
> +
> +		left_child = data + ((pos * 2 + 1) * func->elem_size);
> +		parent = data + (pos * func->elem_size);
> +		large_or_smallest = parent;
> +		if (func->cmp(left_child, large_or_smallest))
> +			large_or_smallest = left_child;
> +
> +		if (pos * 2 + 2 < heap->size) {
> +			right_child = data + ((pos * 2 + 2) * func->elem_size);
> +			if (func->cmp(right_child, large_or_smallest))
> +				large_or_smallest = right_child;
> +		}
> +		if (large_or_smallest == parent)
> +			break;
> +		func->swp(large_or_smallest, parent);
> +		if (large_or_smallest == left_child)
> +			pos = (pos * 2) + 1;
> +		else
> +			pos = (pos * 2) + 2;
> +	}
> +}

[]

> +static void heap_pop_push(struct min_max_heap *heap,
> +			const void *element,
> +			const struct min_max_heap_callbacks *func)
> +{
> +	char *data = (char *)heap->data;
> +
> +	memcpy(data, element, func->elem_size);
> +	heapify(heap, 0, func);
> +}

missing inline.

> +
> +/* Push an element on to the heap, O(log2(size)). */
> +static inline void
> +heap_push(struct min_max_heap *heap, const void *element,
> +	const struct min_max_heap_callbacks *func)
> +{
> +	void *child, *parent;
> +	int pos;
> +	char *data = (char *)heap->data;

Same comment about char * vs void * and unnecessary temporary.

> +
> +	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
> +		return;
> +
> +	/* Place at the end of data. */
> +	pos = heap->size;
> +	memcpy(data + (pos * func->elem_size), element, func->elem_size);
> +	heap->size++;
> +
> +	/* Sift up. */
> +	for (; pos > 0; pos = (pos - 1) / 2) {
> +		child = data + (pos * func->elem_size);
> +		parent = data + ((pos - 1) / 2) * func->elem_size;
> +		if (func->cmp(parent, child))
> +			break;
> +		func->swp(parent, child);
> +	}
> +}
> +
> +#endif /* _LINUX_MIN_MAX_HEAP_H */



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups
  2019-11-16  1:20     ` Ian Rogers
@ 2019-11-18  8:37       ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-18  8:37 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Fri, Nov 15, 2019 at 05:20:52PM -0800, Ian Rogers wrote:
> On Thu, Nov 14, 2019 at 2:25 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > > @@ -1706,18 +1738,10 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
> > >                       continue;
> > >               }
> > >  #ifdef CONFIG_CGROUP_PERF
> > > -             node_cgrp_id = 0;
> > > -             if (node_event->cgrp && node_event->cgrp->css.cgroup)
> > > -                     node_cgrp_id = node_event->cgrp->css.cgroup->id;
> > > -
> > > -             if (cgrp_id < node_cgrp_id) {
> > > +             if (node_event->cgrp) {
> > >                       node = node->rb_left;
> > >                       continue;
> > >               }
> > > -             if (cgrp_id > node_cgrp_id) {
> > > -                     node = node->rb_right;
> > > -                     continue;
> > > -             }
> > >  #endif
> > >               match = node_event;
> > >               node = node->rb_left;
> >
> > Also, just leave that in and let callers have: .cgrp = NULL. Then you
> > can forgo that monstrous name.
> 
> Done. It is a shame that there is some extra logic for the task/no-cgroup case.

Yes, OTOH the primitive is consistent and more generic and possibly the
compiler will notice and fix it for us, it is a static function after
all, so it can be more agressive.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-17 18:28   ` Joe Perches
@ 2019-11-18  8:40     ` Peter Zijlstra
  2019-11-18 11:50       ` Joe Perches
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-18  8:40 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Sun, Nov 17, 2019 at 10:28:09AM -0800, Joe Perches wrote:
> On Wed, 2019-11-13 at 16:30 -0800, Ian Rogers wrote:
> > Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Perhaps some functions are a bit large for inline

It all hard relies on always inline to have the indirect function
pointers constant folded and inlined too.

See for example also: include/linux/rbtree_augmented.h

Yes, its a bit crud, but performance mandates it.

> and perhaps the function names are too generic?

Yeah, noted that already.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-18  8:40     ` Peter Zijlstra
@ 2019-11-18 11:50       ` Joe Perches
  2019-11-18 12:21         ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Joe Perches @ 2019-11-18 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Mon, 2019-11-18 at 09:40 +0100, Peter Zijlstra wrote:
> On Sun, Nov 17, 2019 at 10:28:09AM -0800, Joe Perches wrote:
> > On Wed, 2019-11-13 at 16:30 -0800, Ian Rogers wrote:
> > > Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > 
> > Perhaps some functions are a bit large for inline
> 
> It all hard relies on always inline to have the indirect function
> pointers constant folded and inlined too.

Then perhaps __always_inline is more appropriate.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 02/10] lib: introduce generic min max heap
  2019-11-18 11:50       ` Joe Perches
@ 2019-11-18 12:21         ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2019-11-18 12:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Sri Krishna chowdary, Uladzislau Rezki (Sony),
	Andy Shevchenko, Changbin Du, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Mon, Nov 18, 2019 at 03:50:33AM -0800, Joe Perches wrote:
> On Mon, 2019-11-18 at 09:40 +0100, Peter Zijlstra wrote:
> > On Sun, Nov 17, 2019 at 10:28:09AM -0800, Joe Perches wrote:
> > > On Wed, 2019-11-13 at 16:30 -0800, Ian Rogers wrote:
> > > > Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > 
> > > Perhaps some functions are a bit large for inline
> > 
> > It all hard relies on always inline to have the indirect function
> > pointers constant folded and inlined too.
> 
> Then perhaps __always_inline is more appropriate.

It is.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v4 02/10] lib: introduce generic min max heap
  2019-11-16  1:18   ` [PATCH v4 02/10] lib: introduce generic min max heap Ian Rogers
@ 2019-11-21 11:11     ` Joe Perches
  0 siblings, 0 replies; 80+ messages in thread
From: Joe Perches @ 2019-11-21 11:11 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Masahiro Yamada,
	Kees Cook, Catalin Marinas, Petr Mladek, Mauro Carvalho Chehab,
	Qian Cai, Joe Lawrence, Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen

On Fri, 2019-11-15 at 17:18 -0800, Ian Rogers wrote:
> Supports push, pop and converting an array into a heap.
> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[]
> diff --git a/include/linux/min_max_heap.h b/include/linux/min_max_heap.h
[]
> +/* Sift the element at pos down the heap. */
> +static inline void min_max_heapify(struct min_max_heap *heap, int pos,
> +				const struct min_max_heap_callbacks *func)

s/inline/__always_inline/g

> +static void min_max_heap_pop_push(struct min_max_heap *heap,
> +				const void *element,
> +				const struct min_max_heap_callbacks *func)

And this still misses the inline attribute



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v5 00/10] Optimize cgroup context switch
  2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
                     ` (9 preceding siblings ...)
  2019-11-16  1:18   ` [PATCH v4 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
@ 2019-12-06 23:15   ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
                       ` (10 more replies)
  10 siblings, 11 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Avoid iterating over all per-CPU events during cgroup changing context
switches by organizing events by cgroup.

To make an efficient set of iterators, introduce a min max heap
utility with test.

The v5 patch set renames min_max_heap to min_heap as suggested by
Peter Zijlstra, it also addresses comments around preferring
__always_inline over inline.

The v4 patch set addresses review comments on the v3 patch set by
Peter Zijlstra.

These patches include a caching algorithm to improve the search for
the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
well as rebasing hit "optimize event_filter_match during sched_in"
from https://lkml.org/lkml/2019/8/7/771.

The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git

These patches follow Peter's reorganization and his fixes to the
perf_cpu_context min_heap storage code.

Ian Rogers (8):
  lib: introduce generic min-heap
  perf: Use min_max_heap in visit_groups_merge
  perf: Add per perf_cpu_context min_heap storage
  perf/cgroup: Grow per perf_cpu_context heap storage
  perf/cgroup: Order events in RB tree by cgroup id
  perf: simplify and rename visit_groups_merge
  perf: cache perf_event_groups_first for cgroups
  perf: optimize event_filter_match during sched_in

Kan Liang (1):
  perf/cgroup: Do not switch system-wide events in cgroup switch

Peter Zijlstra (1):
  perf/cgroup: Reorder perf_cgroup_connect()

 include/linux/min_heap.h   | 135 +++++++++
 include/linux/perf_event.h |  15 +
 kernel/events/core.c       | 542 +++++++++++++++++++++++++++++--------
 lib/Kconfig.debug          |  10 +
 lib/Makefile               |   1 +
 lib/test_min_heap.c        | 194 +++++++++++++
 6 files changed, 783 insertions(+), 114 deletions(-)
 create mode 100644 include/linux/min_heap.h
 create mode 100644 lib/test_min_heap.c

-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v5 01/10] perf/cgroup: Reorder perf_cgroup_connect()
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 02/10] lib: introduce generic min-heap Ian Rogers
                       ` (9 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Peter Zijlstra <peterz@infradead.org>

Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 522438887206..9f055ca0651d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10762,12 +10762,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
-
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
@@ -10789,6 +10783,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_pmu;
 	}
 
+	if (cgroup_fd != -1) {
+		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+		if (err)
+			goto err_pmu;
+	}
+
 	err = exclusive_event_init(event);
 	if (err)
 		goto err_pmu;
@@ -10849,12 +10849,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	if (event->hw.target)
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 02/10] lib: introduce generic min-heap
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Supports push, pop and converting an array into a heap. If the sense of
the compare function is inverted then it can provide a max-heap.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/min_heap.h | 135 +++++++++++++++++++++++++++
 lib/Kconfig.debug        |  10 ++
 lib/Makefile             |   1 +
 lib/test_min_heap.c      | 194 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 340 insertions(+)
 create mode 100644 include/linux/min_heap.h
 create mode 100644 lib/test_min_heap.c

diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
new file mode 100644
index 000000000000..0f04f49c0779
--- /dev/null
+++ b/include/linux/min_heap.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MIN_HEAP_H
+#define _LINUX_MIN_HEAP_H
+
+#include <linux/bug.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+/**
+ * struct min_heap - Data structure to hold a min-heap.
+ * @data: Start of array holding the heap elements.
+ * @size: Number of elements currently in the heap.
+ * @cap: Maximum number of elements that can be held in current storage.
+ */
+struct min_heap {
+	void *data;
+	int size;
+	int cap;
+};
+
+/**
+ * struct min_heap_callbacks - Data/functions to customise the min_heap.
+ * @elem_size: The size of each element in bytes.
+ * @cmp: Partial order function for this heap 'less'/'<' for min-heap,
+ *       'greater'/'>' for max-heap.
+ * @swp: Swap elements function.
+ */
+struct min_heap_callbacks {
+	int elem_size;
+	bool (*cmp)(const void *lhs, const void *rhs);
+	void (*swp)(void *lhs, void *rhs);
+};
+
+/* Sift the element at pos down the heap. */
+static __always_inline
+void min_heapify(struct min_heap *heap, int pos,
+		const struct min_heap_callbacks *func)
+{
+	void *left_child, *right_child, *parent, *large_or_smallest;
+	u8 *data = (u8 *)heap->data;
+
+	for (;;) {
+		if (pos * 2 + 1 >= heap->size)
+			break;
+
+		left_child = data + ((pos * 2 + 1) * func->elem_size);
+		parent = data + (pos * func->elem_size);
+		large_or_smallest = parent;
+		if (func->cmp(left_child, large_or_smallest))
+			large_or_smallest = left_child;
+
+		if (pos * 2 + 2 < heap->size) {
+			right_child = data + ((pos * 2 + 2) * func->elem_size);
+			if (func->cmp(right_child, large_or_smallest))
+				large_or_smallest = right_child;
+		}
+		if (large_or_smallest == parent)
+			break;
+		func->swp(large_or_smallest, parent);
+		if (large_or_smallest == left_child)
+			pos = (pos * 2) + 1;
+		else
+			pos = (pos * 2) + 2;
+	}
+}
+
+/* Floyd's approach to heapification that is O(size). */
+static __always_inline
+void min_heapify_all(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	int i;
+
+	for (i = heap->size / 2; i >= 0; i--)
+		min_heapify(heap, i, func);
+}
+
+/* Remove minimum element from the heap, O(log2(size)). */
+static __always_inline
+void min_heap_pop(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size <= 0, "Popping an empty heap"))
+		return;
+
+	/* Place last element at the root (position 0) and then sift down. */
+	heap->size--;
+	memcpy(data, data + (heap->size * func->elem_size), func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/*
+ * Remove the minimum element and then push the given element. The
+ * implementation performs 1 sift (O(log2(size))) and is therefore more
+ * efficient than a pop followed by a push that does 2.
+ */
+static __always_inline
+void min_heap_pop_push(struct min_heap *heap,
+		const void *element,
+		const struct min_heap_callbacks *func)
+{
+	memcpy(heap->data, element, func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/* Push an element on to the heap, O(log2(size)). */
+static __always_inline
+void min_heap_push(struct min_heap *heap, const void *element,
+		const struct min_heap_callbacks *func)
+{
+	void *child, *parent;
+	int pos;
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
+		return;
+
+	/* Place at the end of data. */
+	pos = heap->size;
+	memcpy(data + (pos * func->elem_size), element, func->elem_size);
+	heap->size++;
+
+	/* Sift child at pos up. */
+	for (; pos > 0; pos = (pos - 1) / 2) {
+		child = data + (pos * func->elem_size);
+		parent = data + ((pos - 1) / 2) * func->elem_size;
+		if (func->cmp(parent, child))
+			break;
+		func->swp(parent, child);
+	}
+}
+
+#endif /* _LINUX_MIN_HEAP_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 35accd1d93de..aedad6a89745 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1693,6 +1693,16 @@ config TEST_LIST_SORT
 
 	  If unsure, say N.
 
+config TEST_MIN_HEAP
+	tristate "Min heap test"
+	depends on DEBUG_KERNEL || m
+	help
+	  Enable this to turn on min heap function tests. This test is
+	  executed only once during system boot (so affects only boot time),
+	  or at module load time.
+
+	  If unsure, say N.
+
 config TEST_SORT
 	tristate "Array-based sort test"
 	depends on DEBUG_KERNEL || m
diff --git a/lib/Makefile b/lib/Makefile
index 778ab704e3ad..c30ee12f8a9c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -70,6 +70,7 @@ CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
+obj-$(CONFIG_TEST_MIN_HEAP) += test_min_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
diff --git a/lib/test_min_heap.c b/lib/test_min_heap.c
new file mode 100644
index 000000000000..0f06d1f757b5
--- /dev/null
+++ b/lib/test_min_heap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "min_heap_test: " fmt
+
+/*
+ * Test cases for the min max heap.
+ */
+
+#include <linux/log2.h>
+#include <linux/min_heap.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+
+static __init bool less_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs < *(int *)rhs;
+}
+
+static __init bool greater_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs > *(int *)rhs;
+}
+
+static __init void swap_ints(void *lhs, void *rhs)
+{
+	int temp = *(int *)lhs;
+
+	*(int *)lhs = *(int *)rhs;
+	*(int *)rhs = temp;
+}
+
+static __init int pop_verify_heap(bool min_heap,
+				struct min_heap *heap,
+				const struct min_heap_callbacks *funcs)
+{
+	int last;
+	int *values = (int *)heap->data;
+	int err = 0;
+
+	last = values[0];
+	min_heap_pop(heap, funcs);
+	while (heap->size > 0) {
+		if (min_heap) {
+			if (last > values[0]) {
+				pr_err("error: expected %d <= %d\n", last,
+					values[0]);
+				err++;
+			}
+		} else {
+			if (last < values[0]) {
+				pr_err("error: expected %d >= %d\n", last,
+					values[0]);
+				err++;
+			}
+		}
+		last = values[0];
+		min_heap_pop(heap, funcs);
+	}
+	return err;
+}
+
+static __init int test_heapify_all(bool min_heap)
+{
+	int values[] = { 3, 1, 2, 4, 0x8000000, 0x7FFFFFF, 0,
+			 -3, -1, -2, -4, 0x8000000, 0x7FFFFFF };
+	struct min_heap heap = {
+		.data = values,
+		.size = ARRAY_SIZE(values),
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, err;
+
+	/* Test with known set of values. */
+	min_heapify_all(&heap, &funcs);
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+
+	/* Test with randomly generated values. */
+	heap.size = ARRAY_SIZE(values);
+	for (i = 0; i < heap.size; i++)
+		values[i] = get_random_int();
+
+	min_heapify_all(&heap, &funcs);
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	/* Test with randomly generated values. */
+	while (heap.size < heap.cap) {
+		temp = get_random_int();
+		min_heap_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_pop_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Fill values with data to pop and replace. */
+	temp = min_heap ? 0x80000000 : 0x7FFFFFFF;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_pop_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	heap.size = 0;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with randomly generated values. */
+	for (i = 0; i < ARRAY_SIZE(data); i++) {
+		temp = get_random_int();
+		min_heap_pop_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static int __init test_min_heap_init(void)
+{
+	int err = 0;
+
+	err += test_heapify_all(true);
+	err += test_heapify_all(false);
+	err += test_heap_push(true);
+	err += test_heap_push(false);
+	err += test_heap_pop_push(true);
+	err += test_heap_pop_push(false);
+	if (err) {
+		pr_err("test failed with %d errors\n", err);
+		return -EINVAL;
+	}
+	pr_info("test passed\n");
+	return 0;
+}
+module_init(test_min_heap_init);
+
+static void __exit test_min_heap_exit(void)
+{
+	/* do nothing */
+}
+module_exit(test_min_heap_exit);
+
+MODULE_LICENSE("GPL");
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 02/10] lib: introduce generic min-heap Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-08  7:10       ` kbuild test robot
  2019-12-06 23:15     ` [PATCH v5 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
                       ` (7 subsequent siblings)
  10 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 72 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 18 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9f055ca0651d..e0cc1c833408 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/min_max_heap.h>
 
 #include "internal.h"
 
@@ -3387,32 +3388,67 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static int visit_groups_merge(struct perf_event_groups *groups, int cpu,
-			      int (*func)(struct perf_event *, void *), void *data)
+static bool perf_cmp_group_idx(const void *l, const void *r)
 {
-	struct perf_event **evt, *evt1, *evt2;
+	const struct perf_event *le = l, *re = r;
+
+	return le->group_index < re->group_index;
+}
+
+static void swap_ptr(void *l, void *r)
+{
+	void **lp = l, **rp = r;
+
+	swap(*lp, *rp);
+}
+
+static const struct min_max_heap_callbacks perf_min_heap = {
+	.elem_size = sizeof(struct perf_event *),
+	.cmp = perf_cmp_group_idx,
+	.swp = swap_ptr,
+};
+
+static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
+{
+	struct perf_event **itrs = heap->data;
+
+	if (event) {
+		itrs[heap->size] = event;
+		heap->size++;
+	}
+}
+
+static noinline int visit_groups_merge(struct perf_event_groups *groups,
+				int cpu,
+				int (*func)(struct perf_event *, void *),
+				void *data)
+{
+	/* Space for per CPU and/or any CPU event iterators. */
+	struct perf_event *itrs[2];
+	struct min_max_heap event_heap = {
+		.data = itrs,
+		.size = 0,
+		.cap = ARRAY_SIZE(itrs),
+	};
+	struct perf_event *next;
 	int ret;
 
-	evt1 = perf_event_groups_first(groups, -1);
-	evt2 = perf_event_groups_first(groups, cpu);
+	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	while (evt1 || evt2) {
-		if (evt1 && evt2) {
-			if (evt1->group_index < evt2->group_index)
-				evt = &evt1;
-			else
-				evt = &evt2;
-		} else if (evt1) {
-			evt = &evt1;
-		} else {
-			evt = &evt2;
-		}
+	min_max_heapify_all(&event_heap, &perf_min_heap);
 
-		ret = func(*evt, data);
+	while (event_heap.size) {
+		ret = func(itrs[0], data);
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		next = perf_event_groups_next(itrs[0]);
+		if (next) {
+			min_max_heap_pop_push(&event_heap, &next,
+					&perf_min_heap);
+		} else
+			min_max_heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 04/10] perf: Add per perf_cpu_context min_heap storage
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (2 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 +++++
 kernel/events/core.c       | 63 +++++++++++++++++++++++++-------------
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 34c7c6910026..cd7d3b624655 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -850,6 +850,13 @@ struct perf_cpu_context {
 	int				sched_cb_usage;
 
 	int				online;
+	/*
+	 * Per-CPU storage for iterators used in visit_groups_merge. The default
+	 * storage is of size 2 to hold the CPU and any CPU event iterators.
+	 */
+	int				itr_storage_cap;
+	struct perf_event		**itr_storage;
+	struct perf_event		*itr_default[2];
 };
 
 struct perf_output_handle {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e0cc1c833408..259eca137563 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,7 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
-#include <linux/min_max_heap.h>
+#include <linux/min_heap.h>
 
 #include "internal.h"
 
@@ -3402,13 +3402,13 @@ static void swap_ptr(void *l, void *r)
 	swap(*lp, *rp);
 }
 
-static const struct min_max_heap_callbacks perf_min_heap = {
+static const struct min_heap_callbacks perf_min_heap = {
 	.elem_size = sizeof(struct perf_event *),
 	.cmp = perf_cmp_group_idx,
 	.swp = swap_ptr,
 };
 
-static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
+static void __heap_add(struct min_heap *heap, struct perf_event *event)
 {
 	struct perf_event **itrs = heap->data;
 
@@ -3418,37 +3418,49 @@ static void __heap_add(struct min_max_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_event_groups *groups,
-				int cpu,
+static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+				struct perf_event_groups *groups, int cpu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
-	struct min_max_heap event_heap = {
-		.data = itrs,
-		.size = 0,
-		.cap = ARRAY_SIZE(itrs),
-	};
+	struct min_heap event_heap;
+	struct perf_event **evt;
 	struct perf_event *next;
 	int ret;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	if (cpuctx) {
+		event_heap = (struct min_heap){
+			.data = cpuctx->itr_storage,
+			.size = 0,
+			.cap = cpuctx->itr_storage_cap,
+		};
+	} else {
+		event_heap = (struct min_heap){
+			.data = itrs,
+			.size = 0,
+			.cap = ARRAY_SIZE(itrs),
+		};
+		/* Events not within a CPU context may be on any CPU. */
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	}
+	evt = event_heap.data;
+
 	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	min_max_heapify_all(&event_heap, &perf_min_heap);
+	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(itrs[0], data);
+		ret = func(*evt, data);
 		if (ret)
 			return ret;
 
-		next = perf_event_groups_next(itrs[0]);
-		if (next) {
-			min_max_heap_pop_push(&event_heap, &next,
-					&perf_min_heap);
-		} else
-			min_max_heap_pop(&event_heap, &perf_min_heap);
+		next = perf_event_groups_next(*evt);
+		if (next)
+			min_heap_pop_push(&event_heap, &next, &perf_min_heap);
+		else
+			min_heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;
@@ -3518,7 +3530,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->pinned_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->pinned_groups,
 			   smp_processor_id(),
 			   pinned_sched_in, &sid);
 }
@@ -3533,7 +3548,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->flexible_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->flexible_groups,
 			   smp_processor_id(),
 			   flexible_sched_in, &sid);
 }
@@ -10350,6 +10368,9 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
+
+		cpuctx->itr_storage_cap = ARRAY_SIZE(cpuctx->itr_default);
+		cpuctx->itr_storage = cpuctx->itr_default;
 	}
 
 got_cpu_context:
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 05/10] perf/cgroup: Grow per perf_cpu_context heap storage
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (3 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 259eca137563..1e484ffff572 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
 	rcu_read_unlock();
 }
 
+static int perf_cgroup_ensure_storage(struct perf_event *event,
+				struct cgroup_subsys_state *css)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_event **storage;
+	int cpu, itr_cap, ret = 0;
+
+	/*
+	 * Allow storage to have sufficent space for an iterator for each
+	 * possibly nested cgroup plus an iterator for events with no cgroup.
+	 */
+	for (itr_cap = 1; css; css = css->parent)
+		itr_cap++;
+
+	for_each_possible_cpu(cpu) {
+		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		if (itr_cap <= cpuctx->itr_storage_cap)
+			continue;
+
+		storage = kmalloc_node(itr_cap * sizeof(struct perf_event *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+		if (!storage) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		raw_spin_lock_irq(&cpuctx->ctx.lock);
+		if (cpuctx->itr_storage_cap < itr_cap) {
+			swap(cpuctx->itr_storage, storage);
+			if (storage == cpuctx->itr_default)
+				storage = NULL;
+			cpuctx->itr_storage_cap = itr_cap;
+		}
+		raw_spin_unlock_irq(&cpuctx->ctx.lock);
+
+		kfree(storage);
+	}
+
+	return ret;
+}
+
 static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		goto out;
 	}
 
+	ret = perf_cgroup_ensure_storage(event, css);
+	if (ret)
+		goto out;
+
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
@@ -3436,6 +3481,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.size = 0,
 			.cap = cpuctx->itr_storage_cap,
 		};
+
+		lockdep_assert_held(&cpuctx->ctx.lock);
 	} else {
 		event_heap = (struct min_heap){
 			.data = itrs,
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 06/10] perf/cgroup: Order events in RB tree by cgroup id
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (4 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
                       ` (4 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
---
 kernel/events/core.c | 97 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1e484ffff572..20e08d0c1cb9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1576,6 +1576,30 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right)
 	if (left->cpu > right->cpu)
 		return false;
 
+#ifdef CONFIG_CGROUP_PERF
+	if (left->cgrp != right->cgrp) {
+		if (!left->cgrp || !left->cgrp->css.cgroup) {
+			/*
+			 * Left has no cgroup but right does, no cgroups come
+			 * first.
+			 */
+			return true;
+		}
+		if (!right->cgrp || right->cgrp->css.cgroup) {
+			/*
+			 * Right has no cgroup but left does, no cgroups come
+			 * first.
+			 */
+			return false;
+		}
+		/* Two dissimilar cgroups, order by id. */
+		if (left->cgrp->css.cgroup->id < right->cgrp->css.cgroup->id)
+			return true;
+
+		return false;
+	}
+#endif
+
 	if (left->group_index < right->group_index)
 		return true;
 	if (left->group_index > right->group_index)
@@ -1655,25 +1679,48 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the @cpu subtree.
+ * Get the leftmost event in the cpu/cgroup subtree.
  */
 static struct perf_event *
-perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+perf_event_groups_first(struct perf_event_groups *groups, int cpu,
+			struct cgroup *cgrp)
 {
 	struct perf_event *node_event = NULL, *match = NULL;
 	struct rb_node *node = groups->tree.rb_node;
+#ifdef CONFIG_CGROUP_PERF
+	int node_cgrp_id, cgrp_id = 0;
+
+	if (cgrp)
+		cgrp_id = cgrp->id;
+#endif
 
 	while (node) {
 		node_event = container_of(node, struct perf_event, group_node);
 
 		if (cpu < node_event->cpu) {
 			node = node->rb_left;
-		} else if (cpu > node_event->cpu) {
+			continue;
+		}
+		if (cpu > node_event->cpu) {
 			node = node->rb_right;
-		} else {
-			match = node_event;
+			continue;
+		}
+#ifdef CONFIG_CGROUP_PERF
+		node_cgrp_id = 0;
+		if (node_event->cgrp && node_event->cgrp->css.cgroup)
+			node_cgrp_id = node_event->cgrp->css.cgroup->id;
+
+		if (cgrp_id < node_cgrp_id) {
 			node = node->rb_left;
+			continue;
+		}
+		if (cgrp_id > node_cgrp_id) {
+			node = node->rb_right;
+			continue;
 		}
+#endif
+		match = node_event;
+		node = node->rb_left;
 	}
 
 	return match;
@@ -1686,12 +1733,26 @@ static struct perf_event *
 perf_event_groups_next(struct perf_event *event)
 {
 	struct perf_event *next;
+#ifdef CONFIG_CGROUP_PERF
+	int curr_cgrp_id = 0;
+	int next_cgrp_id = 0;
+#endif
 
 	next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node);
-	if (next && next->cpu == event->cpu)
-		return next;
+	if (next == NULL || next->cpu != event->cpu)
+		return NULL;
 
-	return NULL;
+#ifdef CONFIG_CGROUP_PERF
+	if (event->cgrp && event->cgrp->css.cgroup)
+		curr_cgrp_id = event->cgrp->css.cgroup->id;
+
+	if (next->cgrp && next->cgrp->css.cgroup)
+		next_cgrp_id = next->cgrp->css.cgroup->id;
+
+	if (curr_cgrp_id != next_cgrp_id)
+		return NULL;
+#endif
+	return next;
 }
 
 /*
@@ -3468,6 +3529,9 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
+#ifdef CONFIG_CGROUP_PERF
+	struct cgroup_subsys_state *css = NULL;
+#endif
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_heap event_heap;
@@ -3483,6 +3547,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 		};
 
 		lockdep_assert_held(&cpuctx->ctx.lock);
+
+#ifdef CONFIG_CGROUP_PERF
+		if (cpuctx->cgrp)
+			css = &cpuctx->cgrp->css;
+#endif
 	} else {
 		event_heap = (struct min_heap){
 			.data = itrs,
@@ -3490,11 +3559,19 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
+									NULL));
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+
+#ifdef CONFIG_CGROUP_PERF
+	for (; css; css = css->parent) {
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
+								css->cgroup));
+	}
+#endif
 
 	min_heapify_all(&event_heap, &perf_min_heap);
 
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 07/10] perf: simplify and rename visit_groups_merge
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (5 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
                       ` (3 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

To enable a future caching optimization, pass in whether
visit_groups_merge is operating on pinned or flexible groups. The
is_pinned argument makes the func argument redundant, rename the
function to ctx_groups_sched_in as it just schedules pinned or flexible
groups in. Compute the cpu and groups arguments locally to reduce the
argument list size. Remove sched_in_data as it repeats arguments already
passed in. Merge pinned_sched_in and flexible_sched_in and use the
pinned argument to determine the active list.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 149 ++++++++++++++-----------------------------
 1 file changed, 49 insertions(+), 100 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 20e08d0c1cb9..3da9cc1ebc2d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2133,7 +2133,6 @@ static void perf_group_detach(struct perf_event *event)
 
 		if (!RB_EMPTY_NODE(&event->group_node)) {
 			add_event_to_groups(sibling, event->ctx);
-
 			if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
 				struct list_head *list = sibling->attr.pinned ?
 					&ctx->pinned_active : &ctx->flexible_active;
@@ -2456,6 +2455,8 @@ event_sched_in(struct perf_event *event,
 {
 	int ret = 0;
 
+	WARN_ON_ONCE(event->ctx != ctx);
+
 	lockdep_assert_held(&ctx->lock);
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3524,10 +3525,42 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
-				struct perf_event_groups *groups, int cpu,
-				int (*func)(struct perf_event *, void *),
-				void *data)
+static int merge_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			struct perf_event *event,
+			bool is_pinned,
+			int *can_add_hw)
+{
+	WARN_ON_ONCE(event->ctx != ctx);
+
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return 0;
+
+	if (!event_filter_match(event))
+		return 0;
+
+	if (group_can_go_on(event, cpuctx, 1)) {
+		if (!group_sched_in(event, cpuctx, ctx)) {
+			list_add_tail(&event->active_list, is_pinned
+				? &ctx->pinned_active
+				: &ctx->flexible_active);
+		}
+	}
+
+	if (event->state == PERF_EVENT_STATE_INACTIVE) {
+		if (is_pinned)
+			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+
+		*can_add_hw = 0;
+		ctx->rotate_necessary = 1;
+	}
+
+	return 0;
+}
+
+static int ctx_groups_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx,
+			bool is_pinned)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3537,9 +3570,13 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	struct min_heap event_heap;
 	struct perf_event **evt;
 	struct perf_event *next;
-	int ret;
+	int ret, can_add_hw = 1;
+	int cpu = smp_processor_id();
+	struct perf_event_groups *groups = is_pinned
+		? &ctx->pinned_groups
+		: &ctx->flexible_groups;
 
-	if (cpuctx) {
+	if (ctx == &cpuctx->ctx) {
 		event_heap = (struct min_heap){
 			.data = cpuctx->itr_storage,
 			.size = 0,
@@ -3576,7 +3613,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(*evt, data);
+		ret = merge_sched_in(ctx, cpuctx, *evt, is_pinned, &can_add_hw);
+
 		if (ret)
 			return ret;
 
@@ -3590,96 +3628,6 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-struct sched_in_data {
-	struct perf_event_context *ctx;
-	struct perf_cpu_context *cpuctx;
-	int can_add_hw;
-};
-
-static int pinned_sched_in(struct perf_event *event, void *data)
-{
-	struct sched_in_data *sid = data;
-
-	if (event->state <= PERF_EVENT_STATE_OFF)
-		return 0;
-
-	if (!event_filter_match(event))
-		return 0;
-
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		if (!group_sched_in(event, sid->cpuctx, sid->ctx))
-			list_add_tail(&event->active_list, &sid->ctx->pinned_active);
-	}
-
-	/*
-	 * If this pinned group hasn't been scheduled,
-	 * put it in error state.
-	 */
-	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-
-	return 0;
-}
-
-static int flexible_sched_in(struct perf_event *event, void *data)
-{
-	struct sched_in_data *sid = data;
-
-	if (event->state <= PERF_EVENT_STATE_OFF)
-		return 0;
-
-	if (!event_filter_match(event))
-		return 0;
-
-	if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
-		int ret = group_sched_in(event, sid->cpuctx, sid->ctx);
-		if (ret) {
-			sid->can_add_hw = 0;
-			sid->ctx->rotate_necessary = 1;
-			return 0;
-		}
-		list_add_tail(&event->active_list, &sid->ctx->flexible_active);
-	}
-
-	return 0;
-}
-
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->pinned_groups,
-			   smp_processor_id(),
-			   pinned_sched_in, &sid);
-}
-
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
-{
-	struct sched_in_data sid = {
-		.ctx = ctx,
-		.cpuctx = cpuctx,
-		.can_add_hw = 1,
-	};
-
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->flexible_groups,
-			   smp_processor_id(),
-			   flexible_sched_in, &sid);
-}
-
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -3716,11 +3664,12 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 08/10] perf: cache perf_event_groups_first for cgroups
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (6 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
                       ` (2 subsequent siblings)
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Add a per-CPU cache of the pinned and flexible perf_event_groups_first
value for a cgroup avoiding an O(log(#perf events)) searches during
sched_in.

Based-on-work-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 ++++
 kernel/events/core.c       | 84 ++++++++++++++++++++++++++++++++++----
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cd7d3b624655..a29a38df909e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -892,6 +892,13 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	/*
+	 * A cache of the first event with the perf_cpu_context's
+	 * perf_event_context for the first event in pinned_groups or
+	 * flexible_groups. Avoids an rbtree search during sched_in.
+	 */
+	struct perf_event * __percpu    *pinned_event;
+	struct perf_event * __percpu    *flexible_event;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3da9cc1ebc2d..5935d2474050 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1638,6 +1638,27 @@ perf_event_groups_insert(struct perf_event_groups *groups,
 
 	rb_link_node(&event->group_node, parent, node);
 	rb_insert_color(&event->group_node, &groups->tree);
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned) {
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		} else {
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+		}
+		/*
+		 * Remember smallest, left-most, group index event. The
+		 * less-than condition is only possible if the group_index
+		 * overflows.
+		 */
+		if (!*cgrp_event ||
+			event->group_index < (*cgrp_event)->group_index)
+			*cgrp_event = event;
+	}
+#endif
 }
 
 /*
@@ -1652,6 +1673,9 @@ add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
 	perf_event_groups_insert(groups, event);
 }
 
+static struct perf_event *
+perf_event_groups_next(struct perf_event *event);
+
 /*
  * Delete a group from a tree.
  */
@@ -1662,6 +1686,22 @@ perf_event_groups_delete(struct perf_event_groups *groups,
 	WARN_ON_ONCE(RB_EMPTY_NODE(&event->group_node) ||
 		     RB_EMPTY_ROOT(&groups->tree));
 
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		struct perf_event **cgrp_event;
+
+		if (event->attr.pinned) {
+			cgrp_event = per_cpu_ptr(event->cgrp->pinned_event,
+						event->cpu);
+		} else {
+			cgrp_event = per_cpu_ptr(event->cgrp->flexible_event,
+						event->cpu);
+		}
+		if (*cgrp_event == event)
+			*cgrp_event = perf_event_groups_next(event);
+	}
+#endif
+
 	rb_erase(&event->group_node, &groups->tree);
 	init_event_group(event);
 }
@@ -1679,7 +1719,8 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the cpu subtree without a cgroup (ie task or
+ * system-wide).
  */
 static struct perf_event *
 perf_event_groups_first(struct perf_event_groups *groups, int cpu,
@@ -3596,8 +3637,8 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
-									NULL));
+		__heap_add(&event_heap,
+			perf_event_groups_first(groups, -1, NULL));
 	}
 	evt = event_heap.data;
 
@@ -3605,8 +3646,16 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 
 #ifdef CONFIG_CGROUP_PERF
 	for (; css; css = css->parent) {
-		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
-								css->cgroup));
+		struct perf_cgroup *cgrp;
+
+		/* root cgroup doesn't have events */
+		if (css->id == 1)
+			break;
+
+		cgrp = container_of(css, struct perf_cgroup, css);
+		__heap_add(&event_heap, is_pinned
+			? *per_cpu_ptr(cgrp->pinned_event, cpu)
+			: *per_cpu_ptr(cgrp->flexible_event, cpu));
 	}
 #endif
 
@@ -12672,18 +12721,35 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return ERR_PTR(-ENOMEM);
 
 	jc->info = alloc_percpu(struct perf_cgroup_info);
-	if (!jc->info) {
-		kfree(jc);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!jc->info)
+		goto free_jc;
+
+	jc->pinned_event = alloc_percpu(struct perf_event *);
+	if (!jc->pinned_event)
+		goto free_jc_info;
+
+	jc->flexible_event = alloc_percpu(struct perf_event *);
+	if (!jc->flexible_event)
+		goto free_jc_pinned;
 
 	return &jc->css;
+
+free_jc_pinned:
+	free_percpu(jc->pinned_event);
+free_jc_info:
+	free_percpu(jc->info);
+free_jc:
+	kfree(jc);
+
+	return ERR_PTR(-ENOMEM);
 }
 
 static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
 
+	free_percpu(jc->pinned_event);
+	free_percpu(jc->flexible_event);
 	free_percpu(jc->info);
 	kfree(jc);
 }
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 09/10] perf: optimize event_filter_match during sched_in
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (7 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2019-12-06 23:15     ` [PATCH v5 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The caller verified the CPU and cgroup so directly call
pmu_filter_match.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5935d2474050..bcaf100d8167 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2227,8 +2227,11 @@ static inline int pmu_filter_match(struct perf_event *event)
 static inline int
 event_filter_match(struct perf_event *event)
 {
-	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
-	       perf_cgroup_match(event) && pmu_filter_match(event);
+	if (event->cpu != -1 && event->cpu != smp_processor_id())
+		return 0;
+	if (!perf_cgroup_match(event))
+		return 0;
+	return pmu_filter_match(event);
 }
 
 static void
@@ -3577,7 +3580,11 @@ static int merge_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
-	if (!event_filter_match(event))
+	/*
+	 * Avoid full event_filter_match as the caller verified the CPU and
+	 * cgroup before calling.
+	 */
+	if (!pmu_filter_match(event))
 		return 0;
 
 	if (group_can_go_on(event, cpuctx, 1)) {
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (8 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
@ 2019-12-06 23:15     ` Ian Rogers
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
  10 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Masahiro Yamada, Kees Cook, Catalin Marinas,
	Petr Mladek, Mauro Carvalho Chehab, Qian Cai, Joe Lawrence,
	Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Kan Liang <kan.liang@linux.intel.com>

When counting system-wide events and cgroup events simultaneously, the
system-wide events are always scheduled out then back in during cgroup
switches, bringing extra overhead and possibly missing events. Switching
out system wide flexible events may be necessary if the scheduled in
task's cgroups have pinned events that need to be scheduled in at a higher
priority than the system wide flexible events.

Here is test with 6 child cgroups (sibling cgroups), 1 parent cgroup
and system-wide events.
A specjbb benchmark is running in each child cgroup.
The perf command is as below.
   perf stat -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -e cycles,instructions -e cycles,instructions
   -G cgroup1,cgroup1,cgroup2,cgroup2,cgroup3,cgroup3
   -G cgroup4,cgroup4,cgroup5,cgroup5,cgroup6,cgroup6
   -G cgroup_parent,cgroup_parent
   -a -e cycles,instructions -I 1000

The average RT (Response Time) reported from specjbb is
used as key performance metrics. (The lower the better)
                                        RT(us)              Overhead
Baseline (no perf stat):                4286.9
Use cgroup perf, no patches:            4537.1                5.84%
Use cgroup perf, apply the patch:       4440.7                3.59%

Fixes: e5d1367f17ba ("perf: Add cgroup support")
---
This patch was rebased based on: https://lkml.org/lkml/2019/8/7/771
with some minor changes to comments made by: Ian Rogers
<irogers@google.com>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |   1 +
 kernel/events/core.c       | 133 ++++++++++++++++++++++++++++++++++---
 2 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a29a38df909e..7aa5df2a33eb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -892,6 +892,7 @@ struct perf_cgroup_info {
 struct perf_cgroup {
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
+	unsigned int			nr_pinned_event;
 	/*
 	 * A cache of the first event with the perf_cpu_context's
 	 * perf_event_context for the first event in pinned_groups or
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bcaf100d8167..fc7e9e4b8e3c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -362,8 +362,18 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
+
+	/* see perf_cgroup_switch() for details */
+	EVENT_CGROUP_FLEXIBLE_ONLY = 0x10,
+	EVENT_CGROUP_PINNED_ONLY = 0x20,
+	EVENT_CGROUP_ALL_ONLY = EVENT_CGROUP_FLEXIBLE_ONLY |
+				EVENT_CGROUP_PINNED_ONLY,
+
 };
 
+#define CGROUP_PINNED(type)	(type & EVENT_CGROUP_PINNED_ONLY)
+#define CGROUP_FLEXIBLE(type)	(type & EVENT_CGROUP_FLEXIBLE_ONLY)
+
 /*
  * perf_sched_events : >0 events exist
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
@@ -668,6 +678,20 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
 
 #ifdef CONFIG_CGROUP_PERF
 
+/* Skip system-wide CPU events if only cgroup events are required. */
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool is_pinned)
+{
+	if (event->cgrp)
+		return 0;
+	if (is_pinned)
+		return !!CGROUP_PINNED(event_type);
+	else
+		return !!CGROUP_FLEXIBLE(event_type);
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -694,6 +718,8 @@ perf_cgroup_match(struct perf_event *event)
 
 static inline void perf_detach_cgroup(struct perf_event *event)
 {
+	if (event->attr.pinned)
+		event->cgrp->nr_pinned_event--;
 	css_put(&event->cgrp->css);
 	event->cgrp = NULL;
 }
@@ -781,6 +807,22 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	}
 }
 
+/* Check if cgroup and its ancestor have pinned events attached */
+static bool
+cgroup_has_pinned_events(struct perf_cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct perf_cgroup *tmp_cgrp;
+
+	for (css = &cgrp->css; css; css = css->parent) {
+		tmp_cgrp = container_of(css, struct perf_cgroup, css);
+		if (tmp_cgrp->nr_pinned_event > 0)
+			return true;
+	}
+
+	return false;
+}
+
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 #define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
@@ -812,7 +854,22 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			/*
+			 * The system-wide events and cgroup events share the
+			 * same cpuctx groups. Decide which events to be
+			 * scheduled outbased on the types of events:
+			 * - EVENT_FLEXIBLE | EVENT_CGROUP_FLEXIBLE_ONLY:
+			 *   Only switch cgroup events from EVENT_FLEXIBLE
+			 *   groups.
+			 * - EVENT_PINNED | EVENT_CGROUP_PINNED_ONLY:
+			 *   Only switch cgroup events from EVENT_PINNED
+			 *   groups.
+			 * - EVENT_ALL | EVENT_CGROUP_ALL_ONLY:
+			 *   Only switch cgroup events from both EVENT_FLEXIBLE
+			 *   and EVENT_PINNED groups.
+			 */
+			cpu_ctx_sched_out(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -831,7 +888,23 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 */
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+
+			/*
+			 * To keep the priority order of cpu pinned then cpu
+			 * flexible, if the new cgroup has pinned events then
+			 * sched out all system-wide flexilbe events before
+			 * sched in all events.
+			 */
+			if (cgroup_has_pinned_events(cpuctx->cgrp)) {
+				cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_PINNED_ONLY,
+					task);
+			} else {
+				cpu_ctx_sched_in(cpuctx,
+					EVENT_ALL | EVENT_CGROUP_ALL_ONLY,
+					task);
+			}
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -959,6 +1032,9 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
+	if (event->attr.pinned)
+		cgrp->nr_pinned_event++;
+
 	/*
 	 * all events in a group must monitor
 	 * the same cgroup because a task belongs
@@ -1032,6 +1108,14 @@ list_update_cgroup_event(struct perf_event *event,
 
 #else /* !CONFIG_CGROUP_PERF */
 
+static inline bool
+perf_cgroup_skip_switch(enum event_type_t event_type,
+			struct perf_event *event,
+			bool pinned)
+{
+	return false;
+}
+
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
@@ -3236,13 +3320,25 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
-		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->pinned_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, true)) {
+				ctx->is_active |= EVENT_PINNED;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
+		list_for_each_entry_safe(event, tmp, &ctx->flexible_active,
+					active_list) {
+			if (perf_cgroup_skip_switch(event_type, event, false)) {
+				ctx->is_active |= EVENT_FLEXIBLE;
+				continue;
+			}
 			group_sched_out(event, cpuctx, ctx);
+		}
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -3573,6 +3669,7 @@ static int merge_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
 			struct perf_event *event,
 			bool is_pinned,
+			enum event_type_t event_type,
 			int *can_add_hw)
 {
 	WARN_ON_ONCE(event->ctx != ctx);
@@ -3580,6 +3677,9 @@ static int merge_sched_in(struct perf_event_context *ctx,
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
+	if (perf_cgroup_skip_switch(event_type, event, is_pinned))
+		return 0;
+
 	/*
 	 * Avoid full event_filter_match as the caller verified the CPU and
 	 * cgroup before calling.
@@ -3608,7 +3708,8 @@ static int merge_sched_in(struct perf_event_context *ctx,
 
 static int ctx_groups_sched_in(struct perf_event_context *ctx,
 			struct perf_cpu_context *cpuctx,
-			bool is_pinned)
+			bool is_pinned,
+			enum event_type_t event_type)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
@@ -3669,7 +3770,8 @@ static int ctx_groups_sched_in(struct perf_event_context *ctx,
 	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = merge_sched_in(ctx, cpuctx, *evt, is_pinned, &can_add_hw);
+		ret = merge_sched_in(ctx, cpuctx, *evt, event_type, is_pinned,
+				&can_add_hw);
 
 		if (ret)
 			return ret;
@@ -3690,6 +3792,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     enum event_type_t event_type,
 	     struct task_struct *task)
 {
+	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
 	int is_active = ctx->is_active;
 	u64 now;
 
@@ -3698,7 +3801,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= (ctx_event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3718,14 +3821,22 @@ ctx_sched_in(struct perf_event_context *ctx,
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
+	 *
+	 * System-wide events may not have been scheduled out for a cgroup
+	 * switch.  Unconditionally call sched_in() for cgroup events and
+	 * it will filter the events.
 	 */
-	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true);
+	if ((is_active & EVENT_PINNED) || CGROUP_PINNED(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/true,
+				CGROUP_PINNED(event_type));
+	}
 
 
 	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false);
+	if ((is_active & EVENT_FLEXIBLE) || CGROUP_FLEXIBLE(event_type)) {
+		ctx_groups_sched_in(ctx, cpuctx, /*is_pinned=*/false,
+				CGROUP_FLEXIBLE(event_type));
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.24.0.393.g34dc348eaf-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v3 00/10] Optimize cgroup context switch
  2019-11-14 18:17   ` Ian Rogers
@ 2019-12-06 23:16     ` Ian Rogers
  0 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2019-12-06 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Masahiro Yamada, Kees Cook, Catalin Marinas, Petr Mladek,
	Mauro Carvalho Chehab, Qian Cai, Joe Lawrence, Tetsuo Handa,
	Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang, LKML,
	Stephane Eranian, Andi Kleen

On Thu, Nov 14, 2019 at 10:17 AM Ian Rogers <irogers@google.com> wrote:
>
> On Thu, Nov 14, 2019 at 2:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Nov 13, 2019 at 04:30:32PM -0800, Ian Rogers wrote:
> > > Avoid iterating over all per-CPU events during cgroup changing context
> > > switches by organizing events by cgroup.
> >
> > When last we spoke (Plumbers in Lisbon) you mentioned that this
> > optimization was yielding far less than expected. You had graphs showing
> > how the use of cgroups impacted event scheduling time and how this patch
> > set only reduced that a little.
> >
> > Any update on all that? There seems to be a conspicuous lack of such
> > data here.
>
> I'm working on giving an update on the numbers but I suspect they are
> better than I'd measured ahead of LPC due to a bug in a script.
>
> Thanks,
> Ian

Apologies for the delay, I'm sending v5 that addresses review
comments. I'm still working on performance numbers.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge
  2019-12-06 23:15     ` [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
@ 2019-12-08  7:10       ` kbuild test robot
  0 siblings, 0 replies; 80+ messages in thread
From: kbuild test robot @ 2019-12-08  7:10 UTC (permalink / raw)
  To: Ian Rogers
  Cc: kbuild-all, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Masahiro Yamada,
	Kees Cook, Catalin Marinas, Petr Mladek, Mauro Carvalho Chehab,
	Qian Cai, Joe Lawrence, Tetsuo Handa, Uladzislau Rezki (Sony),
	Andy Shevchenko, Ard Biesheuvel, David S. Miller,
	Kent Overstreet, Gary Hook, Arnd Bergmann, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen, Ian Rogers

[-- Attachment #1: Type: text/plain, Size: 1542 bytes --]

Hi Ian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[cannot apply to tip/auto-latest linus/master v5.4 next-20191206]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Ian-Rogers/Optimize-cgroup-context-switch/20191208-065350
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ceb9e77324fa661b1001a0ae66f061b5fcb4e4e6
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-1) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

Note: the linux-review/Ian-Rogers/Optimize-cgroup-context-switch/20191208-065350 HEAD 90f4b194c2271d12b02888811e887a869d1e3817 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

>> kernel/events/core.c:52:10: fatal error: linux/min_max_heap.h: No such file or directory
    #include <linux/min_max_heap.h>
             ^~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.

vim +52 kernel/events/core.c

  > 52	#include <linux/min_max_heap.h>
    53	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7223 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v6 0/6] Optimize cgroup context switch
  2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
                       ` (9 preceding siblings ...)
  2019-12-06 23:15     ` [PATCH v5 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
@ 2020-02-14  7:51     ` Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
                         ` (7 more replies)
  10 siblings, 8 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Avoid iterating over all per-CPU events during cgroup changing context
switches by organizing events by cgroup.

To make an efficient set of iterators, introduce a min max heap
utility with test.

The v6 patch reduces the patch set by 4 patches, it updates the cgroup
id and fixes part of the min_heap rename from v5.

The v5 patch set renames min_max_heap to min_heap as suggested by
Peter Zijlstra, it also addresses comments around preferring
__always_inline over inline.

The v4 patch set addresses review comments on the v3 patch set by
Peter Zijlstra.

These patches include a caching algorithm to improve the search for
the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
well as rebasing hit "optimize event_filter_match during sched_in"
from https://lkml.org/lkml/2019/8/7/771.

The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git

These patches follow Peter's reorganization and his fixes to the
perf_cpu_context min_heap storage code.

Ian Rogers (5):
  lib: introduce generic min-heap
  perf: Use min_heap in visit_groups_merge
  perf: Add per perf_cpu_context min_heap storage
  perf/cgroup: Grow per perf_cpu_context heap storage
  perf/cgroup: Order events in RB tree by cgroup id

Peter Zijlstra (1):
  perf/cgroup: Reorder perf_cgroup_connect()

 include/linux/min_heap.h   | 135 ++++++++++++++++++++
 include/linux/perf_event.h |   7 ++
 kernel/events/core.c       | 251 +++++++++++++++++++++++++++++++------
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_min_heap.c        | 194 ++++++++++++++++++++++++++++
 6 files changed, 563 insertions(+), 35 deletions(-)
 create mode 100644 include/linux/min_heap.h
 create mode 100644 lib/test_min_heap.c

-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect()
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-02-14 16:11         ` Shuah Khan
  2020-03-06 14:42         ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
  2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
                         ` (6 subsequent siblings)
  7 siblings, 2 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

From: Peter Zijlstra <peterz@infradead.org>

Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3f1f77de7247..9bd2af954c54 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10804,12 +10804,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
-
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
@@ -10831,6 +10825,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_pmu;
 	}
 
+	if (cgroup_fd != -1) {
+		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+		if (err)
+			goto err_pmu;
+	}
+
 	err = exclusive_event_init(event);
 	if (err)
 		goto err_pmu;
@@ -10891,12 +10891,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	if (event->hw.target)
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v6 2/6] lib: introduce generic min-heap
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-02-14 22:06         ` Randy Dunlap
                           ` (2 more replies)
  2020-02-14  7:51       ` [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge Ian Rogers
                         ` (5 subsequent siblings)
  7 siblings, 3 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Supports push, pop and converting an array into a heap. If the sense of
the compare function is inverted then it can provide a max-heap.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/min_heap.h | 135 +++++++++++++++++++++++++++
 lib/Kconfig.debug        |  10 ++
 lib/Makefile             |   1 +
 lib/test_min_heap.c      | 194 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 340 insertions(+)
 create mode 100644 include/linux/min_heap.h
 create mode 100644 lib/test_min_heap.c

diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
new file mode 100644
index 000000000000..0f04f49c0779
--- /dev/null
+++ b/include/linux/min_heap.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MIN_HEAP_H
+#define _LINUX_MIN_HEAP_H
+
+#include <linux/bug.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+/**
+ * struct min_heap - Data structure to hold a min-heap.
+ * @data: Start of array holding the heap elements.
+ * @size: Number of elements currently in the heap.
+ * @cap: Maximum number of elements that can be held in current storage.
+ */
+struct min_heap {
+	void *data;
+	int size;
+	int cap;
+};
+
+/**
+ * struct min_heap_callbacks - Data/functions to customise the min_heap.
+ * @elem_size: The size of each element in bytes.
+ * @cmp: Partial order function for this heap 'less'/'<' for min-heap,
+ *       'greater'/'>' for max-heap.
+ * @swp: Swap elements function.
+ */
+struct min_heap_callbacks {
+	int elem_size;
+	bool (*cmp)(const void *lhs, const void *rhs);
+	void (*swp)(void *lhs, void *rhs);
+};
+
+/* Sift the element at pos down the heap. */
+static __always_inline
+void min_heapify(struct min_heap *heap, int pos,
+		const struct min_heap_callbacks *func)
+{
+	void *left_child, *right_child, *parent, *large_or_smallest;
+	u8 *data = (u8 *)heap->data;
+
+	for (;;) {
+		if (pos * 2 + 1 >= heap->size)
+			break;
+
+		left_child = data + ((pos * 2 + 1) * func->elem_size);
+		parent = data + (pos * func->elem_size);
+		large_or_smallest = parent;
+		if (func->cmp(left_child, large_or_smallest))
+			large_or_smallest = left_child;
+
+		if (pos * 2 + 2 < heap->size) {
+			right_child = data + ((pos * 2 + 2) * func->elem_size);
+			if (func->cmp(right_child, large_or_smallest))
+				large_or_smallest = right_child;
+		}
+		if (large_or_smallest == parent)
+			break;
+		func->swp(large_or_smallest, parent);
+		if (large_or_smallest == left_child)
+			pos = (pos * 2) + 1;
+		else
+			pos = (pos * 2) + 2;
+	}
+}
+
+/* Floyd's approach to heapification that is O(size). */
+static __always_inline
+void min_heapify_all(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	int i;
+
+	for (i = heap->size / 2; i >= 0; i--)
+		min_heapify(heap, i, func);
+}
+
+/* Remove minimum element from the heap, O(log2(size)). */
+static __always_inline
+void min_heap_pop(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size <= 0, "Popping an empty heap"))
+		return;
+
+	/* Place last element at the root (position 0) and then sift down. */
+	heap->size--;
+	memcpy(data, data + (heap->size * func->elem_size), func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/*
+ * Remove the minimum element and then push the given element. The
+ * implementation performs 1 sift (O(log2(size))) and is therefore more
+ * efficient than a pop followed by a push that does 2.
+ */
+static __always_inline
+void min_heap_pop_push(struct min_heap *heap,
+		const void *element,
+		const struct min_heap_callbacks *func)
+{
+	memcpy(heap->data, element, func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/* Push an element on to the heap, O(log2(size)). */
+static __always_inline
+void min_heap_push(struct min_heap *heap, const void *element,
+		const struct min_heap_callbacks *func)
+{
+	void *child, *parent;
+	int pos;
+	u8 *data = (u8 *)heap->data;
+
+	if (WARN_ONCE(heap->size >= heap->cap, "Pushing on a full heap"))
+		return;
+
+	/* Place at the end of data. */
+	pos = heap->size;
+	memcpy(data + (pos * func->elem_size), element, func->elem_size);
+	heap->size++;
+
+	/* Sift child at pos up. */
+	for (; pos > 0; pos = (pos - 1) / 2) {
+		child = data + (pos * func->elem_size);
+		parent = data + ((pos - 1) / 2) * func->elem_size;
+		if (func->cmp(parent, child))
+			break;
+		func->swp(parent, child);
+	}
+}
+
+#endif /* _LINUX_MIN_HEAP_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1458505192cd..e61e7fee9364 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1771,6 +1771,16 @@ config TEST_LIST_SORT
 
 	  If unsure, say N.
 
+config TEST_MIN_HEAP
+	tristate "Min heap test"
+	depends on DEBUG_KERNEL || m
+	help
+	  Enable this to turn on min heap function tests. This test is
+	  executed only once during system boot (so affects only boot time),
+	  or at module load time.
+
+	  If unsure, say N.
+
 config TEST_SORT
 	tristate "Array-based sort test"
 	depends on DEBUG_KERNEL || m
diff --git a/lib/Makefile b/lib/Makefile
index f19b85c87fda..171a6d7874a9 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -70,6 +70,7 @@ CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
+obj-$(CONFIG_TEST_MIN_HEAP) += test_min_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
diff --git a/lib/test_min_heap.c b/lib/test_min_heap.c
new file mode 100644
index 000000000000..0f06d1f757b5
--- /dev/null
+++ b/lib/test_min_heap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "min_heap_test: " fmt
+
+/*
+ * Test cases for the min max heap.
+ */
+
+#include <linux/log2.h>
+#include <linux/min_heap.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+
+static __init bool less_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs < *(int *)rhs;
+}
+
+static __init bool greater_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs > *(int *)rhs;
+}
+
+static __init void swap_ints(void *lhs, void *rhs)
+{
+	int temp = *(int *)lhs;
+
+	*(int *)lhs = *(int *)rhs;
+	*(int *)rhs = temp;
+}
+
+static __init int pop_verify_heap(bool min_heap,
+				struct min_heap *heap,
+				const struct min_heap_callbacks *funcs)
+{
+	int last;
+	int *values = (int *)heap->data;
+	int err = 0;
+
+	last = values[0];
+	min_heap_pop(heap, funcs);
+	while (heap->size > 0) {
+		if (min_heap) {
+			if (last > values[0]) {
+				pr_err("error: expected %d <= %d\n", last,
+					values[0]);
+				err++;
+			}
+		} else {
+			if (last < values[0]) {
+				pr_err("error: expected %d >= %d\n", last,
+					values[0]);
+				err++;
+			}
+		}
+		last = values[0];
+		min_heap_pop(heap, funcs);
+	}
+	return err;
+}
+
+static __init int test_heapify_all(bool min_heap)
+{
+	int values[] = { 3, 1, 2, 4, 0x8000000, 0x7FFFFFF, 0,
+			 -3, -1, -2, -4, 0x8000000, 0x7FFFFFF };
+	struct min_heap heap = {
+		.data = values,
+		.size = ARRAY_SIZE(values),
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, err;
+
+	/* Test with known set of values. */
+	min_heapify_all(&heap, &funcs);
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+
+	/* Test with randomly generated values. */
+	heap.size = ARRAY_SIZE(values);
+	for (i = 0; i < heap.size; i++)
+		values[i] = get_random_int();
+
+	min_heapify_all(&heap, &funcs);
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	/* Test with randomly generated values. */
+	while (heap.size < heap.cap) {
+		temp = get_random_int();
+		min_heap_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_pop_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.size = 0,
+		.cap =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.cmp = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Fill values with data to pop and replace. */
+	temp = min_heap ? 0x80000000 : 0x7FFFFFFF;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_pop_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	heap.size = 0;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with randomly generated values. */
+	for (i = 0; i < ARRAY_SIZE(data); i++) {
+		temp = get_random_int();
+		min_heap_pop_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static int __init test_min_heap_init(void)
+{
+	int err = 0;
+
+	err += test_heapify_all(true);
+	err += test_heapify_all(false);
+	err += test_heap_push(true);
+	err += test_heap_push(false);
+	err += test_heap_pop_push(true);
+	err += test_heap_pop_push(false);
+	if (err) {
+		pr_err("test failed with %d errors\n", err);
+		return -EINVAL;
+	}
+	pr_info("test passed\n");
+	return 0;
+}
+module_init(test_min_heap_init);
+
+static void __exit test_min_heap_exit(void)
+{
+	/* do nothing */
+}
+module_exit(test_min_heap_exit);
+
+MODULE_LICENSE("GPL");
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-02-17 17:23         ` Peter Zijlstra
  2020-03-06 14:42         ` [tip: perf/core] perf/core: Use min_heap in visit_groups_merge() tip-bot2 for Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 4/6] perf: Add per perf_cpu_context min_heap storage Ian Rogers
                         ` (4 subsequent siblings)
  7 siblings, 2 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 72 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 18 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9bd2af954c54..832e2a56a663 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/min_heap.h>
 
 #include "internal.h"
 
@@ -3388,32 +3389,67 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static int visit_groups_merge(struct perf_event_groups *groups, int cpu,
-			      int (*func)(struct perf_event *, void *), void *data)
+static bool perf_cmp_group_idx(const void *l, const void *r)
 {
-	struct perf_event **evt, *evt1, *evt2;
+	const struct perf_event *le = l, *re = r;
+
+	return le->group_index < re->group_index;
+}
+
+static void swap_ptr(void *l, void *r)
+{
+	void **lp = l, **rp = r;
+
+	swap(*lp, *rp);
+}
+
+static const struct min_heap_callbacks perf_min_heap = {
+	.elem_size = sizeof(struct perf_event *),
+	.cmp = perf_cmp_group_idx,
+	.swp = swap_ptr,
+};
+
+static void __heap_add(struct min_heap *heap, struct perf_event *event)
+{
+	struct perf_event **itrs = heap->data;
+
+	if (event) {
+		itrs[heap->size] = event;
+		heap->size++;
+	}
+}
+
+static noinline int visit_groups_merge(struct perf_event_groups *groups,
+				int cpu,
+				int (*func)(struct perf_event *, void *),
+				void *data)
+{
+	/* Space for per CPU and/or any CPU event iterators. */
+	struct perf_event *itrs[2];
+	struct min_heap event_heap = {
+		.data = itrs,
+		.size = 0,
+		.cap = ARRAY_SIZE(itrs),
+	};
+	struct perf_event *next;
 	int ret;
 
-	evt1 = perf_event_groups_first(groups, -1);
-	evt2 = perf_event_groups_first(groups, cpu);
+	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	while (evt1 || evt2) {
-		if (evt1 && evt2) {
-			if (evt1->group_index < evt2->group_index)
-				evt = &evt1;
-			else
-				evt = &evt2;
-		} else if (evt1) {
-			evt = &evt1;
-		} else {
-			evt = &evt2;
-		}
+	min_heapify_all(&event_heap, &perf_min_heap);
 
-		ret = func(*evt, data);
+	while (event_heap.size) {
+		ret = func(itrs[0], data);
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		next = perf_event_groups_next(itrs[0]);
+		if (next) {
+			min_heap_pop_push(&event_heap, &next,
+					&perf_min_heap);
+		} else
+			min_heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v6 4/6] perf: Add per perf_cpu_context min_heap storage
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
                         ` (2 preceding siblings ...)
  2020-02-14  7:51       ` [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-03-06 14:42         ` [tip: perf/core] perf/core: " tip-bot2 for Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 5/6] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
                         ` (3 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.
Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Ian Rogers <irogers@google.com>
---
 include/linux/perf_event.h |  7 +++++
 kernel/events/core.c       | 53 ++++++++++++++++++++++++++------------
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 68e21e828893..5060e31b32cc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -862,6 +862,13 @@ struct perf_cpu_context {
 	int				sched_cb_usage;
 
 	int				online;
+	/*
+	 * Per-CPU storage for iterators used in visit_groups_merge. The default
+	 * storage is of size 2 to hold the CPU and any CPU event iterators.
+	 */
+	int				itr_storage_cap;
+	struct perf_event		**itr_storage;
+	struct perf_event		*itr_default[2];
 };
 
 struct perf_output_handle {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 832e2a56a663..18e4bb871d85 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3419,36 +3419,48 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_event_groups *groups,
-				int cpu,
+static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+				struct perf_event_groups *groups, int cpu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
-	struct min_heap event_heap = {
-		.data = itrs,
-		.size = 0,
-		.cap = ARRAY_SIZE(itrs),
-	};
+	struct min_heap event_heap;
+	struct perf_event **evt;
 	struct perf_event *next;
 	int ret;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	if (cpuctx) {
+		event_heap = (struct min_heap){
+			.data = cpuctx->itr_storage,
+			.size = 0,
+			.cap = cpuctx->itr_storage_cap,
+		};
+	} else {
+		event_heap = (struct min_heap){
+			.data = itrs,
+			.size = 0,
+			.cap = ARRAY_SIZE(itrs),
+		};
+		/* Events not within a CPU context may be on any CPU. */
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	}
+	evt = event_heap.data;
+
 	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
 	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.size) {
-		ret = func(itrs[0], data);
+		ret = func(*evt, data);
 		if (ret)
 			return ret;
 
-		next = perf_event_groups_next(itrs[0]);
-		if (next) {
-			min_heap_pop_push(&event_heap, &next,
-					&perf_min_heap);
-		} else
+		next = perf_event_groups_next(*evt);
+		if (next)
+			min_heap_pop_push(&event_heap, &next, &perf_min_heap);
+		else
 			min_heap_pop(&event_heap, &perf_min_heap);
 	}
 
@@ -3519,7 +3531,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->pinned_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->pinned_groups,
 			   smp_processor_id(),
 			   pinned_sched_in, &sid);
 }
@@ -3534,7 +3549,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 		.can_add_hw = 1,
 	};
 
-	visit_groups_merge(&ctx->flexible_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->flexible_groups,
 			   smp_processor_id(),
 			   flexible_sched_in, &sid);
 }
@@ -10395,6 +10413,9 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
+
+		cpuctx->itr_storage_cap = ARRAY_SIZE(cpuctx->itr_default);
+		cpuctx->itr_storage = cpuctx->itr_default;
 	}
 
 got_cpu_context:
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v6 5/6] perf/cgroup: Grow per perf_cpu_context heap storage
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
                         ` (3 preceding siblings ...)
  2020-02-14  7:51       ` [PATCH v6 4/6] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-03-06 14:42         ` [tip: perf/core] " tip-bot2 for Ian Rogers
  2020-02-14  7:51       ` [PATCH v6 6/6] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
                         ` (2 subsequent siblings)
  7 siblings, 1 reply; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 kernel/events/core.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 18e4bb871d85..4eb4e67463bf 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
 	rcu_read_unlock();
 }
 
+static int perf_cgroup_ensure_storage(struct perf_event *event,
+				struct cgroup_subsys_state *css)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_event **storage;
+	int cpu, itr_cap, ret = 0;
+
+	/*
+	 * Allow storage to have sufficent space for an iterator for each
+	 * possibly nested cgroup plus an iterator for events with no cgroup.
+	 */
+	for (itr_cap = 1; css; css = css->parent)
+		itr_cap++;
+
+	for_each_possible_cpu(cpu) {
+		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		if (itr_cap <= cpuctx->itr_storage_cap)
+			continue;
+
+		storage = kmalloc_node(itr_cap * sizeof(struct perf_event *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+		if (!storage) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		raw_spin_lock_irq(&cpuctx->ctx.lock);
+		if (cpuctx->itr_storage_cap < itr_cap) {
+			swap(cpuctx->itr_storage, storage);
+			if (storage == cpuctx->itr_default)
+				storage = NULL;
+			cpuctx->itr_storage_cap = itr_cap;
+		}
+		raw_spin_unlock_irq(&cpuctx->ctx.lock);
+
+		kfree(storage);
+	}
+
+	return ret;
+}
+
 static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		goto out;
 	}
 
+	ret = perf_cgroup_ensure_storage(event, css);
+	if (ret)
+		goto out;
+
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
@@ -3437,6 +3482,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.size = 0,
 			.cap = cpuctx->itr_storage_cap,
 		};
+
+		lockdep_assert_held(&cpuctx->ctx.lock);
 	} else {
 		event_heap = (struct min_heap){
 			.data = itrs,
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v6 6/6] perf/cgroup: Order events in RB tree by cgroup id
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
                         ` (4 preceding siblings ...)
  2020-02-14  7:51       ` [PATCH v6 5/6] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2020-02-14  7:51       ` Ian Rogers
  2020-02-14 19:32       ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
  2020-02-17 16:18       ` Peter Zijlstra
  7 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Ian Rogers

If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
---
 kernel/events/core.c | 97 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4eb4e67463bf..2eb17c2be5fc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1577,6 +1577,30 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right)
 	if (left->cpu > right->cpu)
 		return false;
 
+#ifdef CONFIG_CGROUP_PERF
+	if (left->cgrp != right->cgrp) {
+		if (!left->cgrp || !left->cgrp->css.cgroup) {
+			/*
+			 * Left has no cgroup but right does, no cgroups come
+			 * first.
+			 */
+			return true;
+		}
+		if (!right->cgrp || right->cgrp->css.cgroup) {
+			/*
+			 * Right has no cgroup but left does, no cgroups come
+			 * first.
+			 */
+			return false;
+		}
+		/* Two dissimilar cgroups, order by id. */
+		if (left->cgrp->css.cgroup->kn->id < right->cgrp->css.cgroup->kn->id)
+			return true;
+
+		return false;
+	}
+#endif
+
 	if (left->group_index < right->group_index)
 		return true;
 	if (left->group_index > right->group_index)
@@ -1656,25 +1680,48 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the @cpu subtree.
+ * Get the leftmost event in the cpu/cgroup subtree.
  */
 static struct perf_event *
-perf_event_groups_first(struct perf_event_groups *groups, int cpu)
+perf_event_groups_first(struct perf_event_groups *groups, int cpu,
+			struct cgroup *cgrp)
 {
 	struct perf_event *node_event = NULL, *match = NULL;
 	struct rb_node *node = groups->tree.rb_node;
+#ifdef CONFIG_CGROUP_PERF
+	u64 node_cgrp_id, cgrp_id = 0;
+
+	if (cgrp)
+		cgrp_id = cgrp->kn->id;
+#endif
 
 	while (node) {
 		node_event = container_of(node, struct perf_event, group_node);
 
 		if (cpu < node_event->cpu) {
 			node = node->rb_left;
-		} else if (cpu > node_event->cpu) {
+			continue;
+		}
+		if (cpu > node_event->cpu) {
 			node = node->rb_right;
-		} else {
-			match = node_event;
+			continue;
+		}
+#ifdef CONFIG_CGROUP_PERF
+		node_cgrp_id = 0;
+		if (node_event->cgrp && node_event->cgrp->css.cgroup)
+			node_cgrp_id = node_event->cgrp->css.cgroup->kn->id;
+
+		if (cgrp_id < node_cgrp_id) {
 			node = node->rb_left;
+			continue;
+		}
+		if (cgrp_id > node_cgrp_id) {
+			node = node->rb_right;
+			continue;
 		}
+#endif
+		match = node_event;
+		node = node->rb_left;
 	}
 
 	return match;
@@ -1687,12 +1734,26 @@ static struct perf_event *
 perf_event_groups_next(struct perf_event *event)
 {
 	struct perf_event *next;
+#ifdef CONFIG_CGROUP_PERF
+	u64 curr_cgrp_id = 0;
+	u64 next_cgrp_id = 0;
+#endif
 
 	next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node);
-	if (next && next->cpu == event->cpu)
-		return next;
+	if (next == NULL || next->cpu != event->cpu)
+		return NULL;
 
-	return NULL;
+#ifdef CONFIG_CGROUP_PERF
+	if (event->cgrp && event->cgrp->css.cgroup)
+		curr_cgrp_id = event->cgrp->css.cgroup->kn->id;
+
+	if (next->cgrp && next->cgrp->css.cgroup)
+		next_cgrp_id = next->cgrp->css.cgroup->kn->id;
+
+	if (curr_cgrp_id != next_cgrp_id)
+		return NULL;
+#endif
+	return next;
 }
 
 /*
@@ -3469,6 +3530,9 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
+#ifdef CONFIG_CGROUP_PERF
+	struct cgroup_subsys_state *css = NULL;
+#endif
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_heap event_heap;
@@ -3484,6 +3548,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 		};
 
 		lockdep_assert_held(&cpuctx->ctx.lock);
+
+#ifdef CONFIG_CGROUP_PERF
+		if (cpuctx->cgrp)
+			css = &cpuctx->cgrp->css;
+#endif
 	} else {
 		event_heap = (struct min_heap){
 			.data = itrs,
@@ -3491,11 +3560,19 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.cap = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1,
+									NULL));
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+
+#ifdef CONFIG_CGROUP_PERF
+	for (; css; css = css->parent) {
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu,
+								css->cgroup));
+	}
+#endif
 
 	min_heapify_all(&event_heap, &perf_min_heap);
 
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect()
  2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
@ 2020-02-14 16:11         ` Shuah Khan
  2020-02-14 17:37           ` Peter Zijlstra
  2020-03-06 14:42         ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 80+ messages in thread
From: Shuah Khan @ 2020-02-14 16:11 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Randy Dunlap,
	Masahiro Yamada, Krzysztof Kozlowski, Kees Cook,
	Paul E. McKenney, Masami Hiramatsu, Marco Elver, Kent Overstreet,
	Andy Shevchenko, Ard Biesheuvel, Gary Hook, Kan Liang,
	linux-kernel
  Cc: Stephane Eranian, Andi Kleen, Shuah Khan

On 2/14/20 12:51 AM, Ian Rogers wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Move perf_cgroup_connect() after perf_event_alloc(), such that we can
> find/use the PMU's cpu context.

Can you elaborate on this usage? It will helpful to know how
this is used and what do we get from it. What were we missing
with the way it was done before?


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>   kernel/events/core.c | 16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3f1f77de7247..9bd2af954c54 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -10804,12 +10804,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>   	if (!has_branch_stack(event))
>   		event->attr.branch_sample_type = 0;
>   
> -	if (cgroup_fd != -1) {
> -		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
> -		if (err)
> -			goto err_ns;
> -	}
> -
>   	pmu = perf_init_event(event);
>   	if (IS_ERR(pmu)) {
>   		err = PTR_ERR(pmu);
> @@ -10831,6 +10825,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>   		goto err_pmu;

Is this patch based on linux-next or linux 5.6-rc1. I am finding code
path to be different in those. Also in 
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git

goto err_ns makes more sense since if perf_init_event() doesn't return
valid pmu, especially since err_pmu tries to do a put pmu->module.

Something doesn't look right.


>   	}


>   
> +	if (cgroup_fd != -1) {
> +		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
> +		if (err)
> +			goto err_pmu;
> +	}
> +
>   	err = exclusive_event_init(event);
>   	if (err)
>   		goto err_pmu;
> @@ -10891,12 +10891,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>   	exclusive_event_destroy(event);
>   
>   err_pmu:
> +	if (is_cgroup_event(event))
> +		perf_detach_cgroup(event);
>   	if (event->destroy)
>   		event->destroy(event);
>   	module_put(pmu->module);
>   err_ns:
> -	if (is_cgroup_event(event))
> -		perf_detach_cgroup(event);
>   	if (event->ns)
>   		put_pid_ns(event->ns);
>   	if (event->hw.target)
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect()
  2020-02-14 16:11         ` Shuah Khan
@ 2020-02-14 17:37           ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2020-02-14 17:37 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Ian Rogers, Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Randy Dunlap, Masahiro Yamada, Krzysztof Kozlowski, Kees Cook,
	Paul E. McKenney, Masami Hiramatsu, Marco Elver, Kent Overstreet,
	Andy Shevchenko, Ard Biesheuvel, Gary Hook, Kan Liang,
	linux-kernel, Stephane Eranian, Andi Kleen

On Fri, Feb 14, 2020 at 09:11:24AM -0700, Shuah Khan wrote:
> On 2/14/20 12:51 AM, Ian Rogers wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Move perf_cgroup_connect() after perf_event_alloc(), such that we can
> > find/use the PMU's cpu context.
> 
> Can you elaborate on this usage? It will helpful to know how
> this is used and what do we get from it. What were we missing
> with the way it was done before?

It says so right there. We need it after perf_event_alloc() so that we
can find/use the event's PMU cpu context.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 0/6] Optimize cgroup context switch
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
                         ` (5 preceding siblings ...)
  2020-02-14  7:51       ` [PATCH v6 6/6] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
@ 2020-02-14 19:32       ` Ian Rogers
  2020-02-17 16:18       ` Peter Zijlstra
  7 siblings, 0 replies; 80+ messages in thread
From: Ian Rogers @ 2020-02-14 19:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andrew Morton, Randy Dunlap, Masahiro Yamada, Shuah Khan,
	Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Kan Liang, LKML

On a thread related to these patches Peter had previously asked for
what the performance numbers looked like. I've tested on Westmere and
Cascade Lake platforms. The benchmark is a set of processes in
different cgroups reading/writing to a file descriptor, where the read
context switches. To ensure the context switch all the processes are
pinned to a particular CPU, the benchmark is tested to ensure the
expected context-switches matches those performed. The benchmark
increases the number of perf events and cgroups, it also looks at the
effect of just monitoring 1 cgroup in an increasing set of cgroups.

Before the patches on Westmere if we do system wide profiling of 10
events and then increase the cgroups to 208 and monitor just one, the
context switch times go from 4.6us to 15.3us. If we monitor each
cgroup then the context switch times are 172.5us. With the patches,
the time for monitoring 1 cgroup goes from 4.6us to 14.9us, but when
monitoring all cgroups the context switch times are 14.1us. The small
speed up when monitoring 1 cgroup out of  a set is that in most
context switches the O(n) search for an event in a cgroup is now
O(log(n)). When all cgroups are monitored the number of events in the
kernel is the product of the number of events and cgroups, giving a
larger value for 'n' and a more dramatic speed up - 172.5us becomes
14.9us.

In summary what we see for performance is that before the patches we
see context switch times being affected by the number of cgroups
monitored, after the patches there is still a context switch cost in
monitoring events, but it is similar whether 1 or all cgroups are
being monitored. This fits with the intuition of what the patches are
trying to do by avoiding searches of events that are for cgroups the
current task isn't within.The results are consistent but less dramatic
for smaller numbers of events and cgroups. We've not identified a slow
down from the patches, but there is a degree of noise in the timing
data. Broadly, with turbo disabled on the test machines the patches
make context switch performance the same or faster. For a more
representative number of events and cgroups, say 6 and 32, we see
context switch time improve from 29.4us to 13.2us when all cgroups are
monitored.

Thanks,
Ian


On Thu, Feb 13, 2020 at 11:51 PM Ian Rogers <irogers@google.com> wrote:
>
> Avoid iterating over all per-CPU events during cgroup changing context
> switches by organizing events by cgroup.
>
> To make an efficient set of iterators, introduce a min max heap
> utility with test.
>
> The v6 patch reduces the patch set by 4 patches, it updates the cgroup
> id and fixes part of the min_heap rename from v5.
>
> The v5 patch set renames min_max_heap to min_heap as suggested by
> Peter Zijlstra, it also addresses comments around preferring
> __always_inline over inline.
>
> The v4 patch set addresses review comments on the v3 patch set by
> Peter Zijlstra.
>
> These patches include a caching algorithm to improve the search for
> the first event in a group by Kan Liang <kan.liang@linux.intel.com> as
> well as rebasing hit "optimize event_filter_match during sched_in"
> from https://lkml.org/lkml/2019/8/7/771.
>
> The v2 patch set was modified by Peter Zijlstra in his perf/cgroup
> branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git
>
> These patches follow Peter's reorganization and his fixes to the
> perf_cpu_context min_heap storage code.
>
> Ian Rogers (5):
>   lib: introduce generic min-heap
>   perf: Use min_heap in visit_groups_merge
>   perf: Add per perf_cpu_context min_heap storage
>   perf/cgroup: Grow per perf_cpu_context heap storage
>   perf/cgroup: Order events in RB tree by cgroup id
>
> Peter Zijlstra (1):
>   perf/cgroup: Reorder perf_cgroup_connect()
>
>  include/linux/min_heap.h   | 135 ++++++++++++++++++++
>  include/linux/perf_event.h |   7 ++
>  kernel/events/core.c       | 251 +++++++++++++++++++++++++++++++------
>  lib/Kconfig.debug          |  10 ++
>  lib/Makefile               |   1 +
>  lib/test_min_heap.c        | 194 ++++++++++++++++++++++++++++
>  6 files changed, 563 insertions(+), 35 deletions(-)
>  create mode 100644 include/linux/min_heap.h
>  create mode 100644 lib/test_min_heap.c
>
> --
> 2.25.0.265.gbab2e86ba0-goog
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 2/6] lib: introduce generic min-heap
  2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
@ 2020-02-14 22:06         ` Randy Dunlap
  2020-02-17 16:29         ` Peter Zijlstra
  2020-03-06 14:42         ` [tip: perf/core] lib: Introduce " tip-bot2 for Ian Rogers
  2 siblings, 0 replies; 80+ messages in thread
From: Randy Dunlap @ 2020-02-14 22:06 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Masahiro Yamada,
	Shuah Khan, Krzysztof Kozlowski, Kees Cook, Paul E. McKenney,
	Masami Hiramatsu, Marco Elver, Kent Overstreet, Andy Shevchenko,
	Ard Biesheuvel, Gary Hook, Kan Liang, linux-kernel
  Cc: Stephane Eranian, Andi Kleen

Hi,

On 2/13/20 11:51 PM, Ian Rogers wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 1458505192cd..e61e7fee9364 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,6 +1771,16 @@ config TEST_LIST_SORT
>  
>  	  If unsure, say N.
>  
> +config TEST_MIN_HEAP
> +	tristate "Min heap test"
> +	depends on DEBUG_KERNEL || m

I realize that this is (likely) copied from other config entries,
but the "depends on DEBUG_KERNEL || m" doesn't make any sense to me.
Seems like it should be "depends on DEBUG_KERNEL && m"...

Why should it be "||"??


> +	help
> +	  Enable this to turn on min heap function tests. This test is
> +	  executed only once during system boot (so affects only boot time),
> +	  or at module load time.
> +
> +	  If unsure, say N.
> +
>  config TEST_SORT
>  	tristate "Array-based sort test"
>  	depends on DEBUG_KERNEL || m


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 0/6] Optimize cgroup context switch
  2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
                         ` (6 preceding siblings ...)
  2020-02-14 19:32       ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
@ 2020-02-17 16:18       ` Peter Zijlstra
  7 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2020-02-17 16:18 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Randy Dunlap, Masahiro Yamada, Shuah Khan, Krzysztof Kozlowski,
	Kees Cook, Paul E. McKenney, Masami Hiramatsu, Marco Elver,
	Kent Overstreet, Andy Shevchenko, Ard Biesheuvel, Gary Hook,
	Kan Liang, linux-kernel, Stephane Eranian, Andi Kleen



Please don't thread to the last series; I only found this by accident
because I was looking at an email Kan referenced.

Now I have to go find how to break threading in Mutt again :/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 2/6] lib: introduce generic min-heap
  2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
  2020-02-14 22:06         ` Randy Dunlap
@ 2020-02-17 16:29         ` Peter Zijlstra
  2020-03-06 14:42         ` [tip: perf/core] lib: Introduce " tip-bot2 for Ian Rogers
  2 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2020-02-17 16:29 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Randy Dunlap, Masahiro Yamada, Shuah Khan, Krzysztof Kozlowski,
	Kees Cook, Paul E. McKenney, Masami Hiramatsu, Marco Elver,
	Kent Overstreet, Andy Shevchenko, Ard Biesheuvel, Gary Hook,
	Kan Liang, linux-kernel, Stephane Eranian, Andi Kleen

On Thu, Feb 13, 2020 at 11:51:29PM -0800, Ian Rogers wrote:
> Supports push, pop and converting an array into a heap. If the sense of
> the compare function is inverted then it can provide a max-heap.

+whitespace

> Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
-whitespace

> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  include/linux/min_heap.h | 135 +++++++++++++++++++++++++++
>  lib/Kconfig.debug        |  10 ++
>  lib/Makefile             |   1 +
>  lib/test_min_heap.c      | 194 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 340 insertions(+)
>  create mode 100644 include/linux/min_heap.h
>  create mode 100644 lib/test_min_heap.c
> 
> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
> new file mode 100644
> index 000000000000..0f04f49c0779
> --- /dev/null
> +++ b/include/linux/min_heap.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MIN_HEAP_H
> +#define _LINUX_MIN_HEAP_H
> +
> +#include <linux/bug.h>
> +#include <linux/string.h>
> +#include <linux/types.h>
> +
> +/**
> + * struct min_heap - Data structure to hold a min-heap.
> + * @data: Start of array holding the heap elements.
> + * @size: Number of elements currently in the heap.
> + * @cap: Maximum number of elements that can be held in current storage.
> + */
> +struct min_heap {
> +	void *data;
> +	int size;
> +	int cap;
> +};
> +
> +/**
> + * struct min_heap_callbacks - Data/functions to customise the min_heap.
> + * @elem_size: The size of each element in bytes.
> + * @cmp: Partial order function for this heap 'less'/'<' for min-heap,
> + *       'greater'/'>' for max-heap.

Since the thing is now called min_heap, 's/cmp/less/g'. cmp in C is a
-1,0,1 like thing.

> + * @swp: Swap elements function.
> + */
> +struct min_heap_callbacks {
> +	int elem_size;
> +	bool (*cmp)(const void *lhs, const void *rhs);
> +	void (*swp)(void *lhs, void *rhs);
> +};
> +
> +/* Sift the element at pos down the heap. */
> +static __always_inline
> +void min_heapify(struct min_heap *heap, int pos,
> +		const struct min_heap_callbacks *func)
> +{
> +	void *left_child, *right_child, *parent, *large_or_smallest;

's/large_or_smallest/smallest/g' ?

> +	u8 *data = (u8 *)heap->data;

void * has byte sized arithmetic

> +
> +	for (;;) {
> +		if (pos * 2 + 1 >= heap->size)
> +			break;
> +
> +		left_child = data + ((pos * 2 + 1) * func->elem_size);
> +		parent = data + (pos * func->elem_size);

> +		large_or_smallest = parent;
> +		if (func->cmp(left_child, large_or_smallest))
> +			large_or_smallest = left_child;

		smallest = parent;
		if (func->less(left_child, smallest);
			smallest = left_child;

Makes sense, no?

> +
> +		if (pos * 2 + 2 < heap->size) {
> +			right_child = data + ((pos * 2 + 2) * func->elem_size);
> +			if (func->cmp(right_child, large_or_smallest))
> +				large_or_smallest = right_child;
> +		}
> +		if (large_or_smallest == parent)
> +			break;
> +		func->swp(large_or_smallest, parent);
> +		if (large_or_smallest == left_child)
> +			pos = (pos * 2) + 1;
> +		else
> +			pos = (pos * 2) + 2;

> +/*
> + * Remove the minimum element and then push the given element. The
> + * implementation performs 1 sift (O(log2(size))) and is therefore more
> + * efficient than a pop followed by a push that does 2.
> + */
> +static __always_inline
> +void min_heap_pop_push(struct min_heap *heap,
> +		const void *element,
> +		const struct min_heap_callbacks *func)
> +{
> +	memcpy(heap->data, element, func->elem_size);
> +	min_heapify(heap, 0, func);
> +}

I still think this is a mightly weird primitive. I think I simply did:

	*evt = perf_event_group(next);
	if (*evt)
		min_heapify(..);


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge
  2020-02-14  7:51       ` [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge Ian Rogers
@ 2020-02-17 17:23         ` Peter Zijlstra
  2020-03-06 14:42         ` [tip: perf/core] perf/core: Use min_heap in visit_groups_merge() tip-bot2 for Ian Rogers
  1 sibling, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2020-02-17 17:23 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Randy Dunlap, Masahiro Yamada, Shuah Khan, Krzysztof Kozlowski,
	Kees Cook, Paul E. McKenney, Masami Hiramatsu, Marco Elver,
	Kent Overstreet, Andy Shevchenko, Ard Biesheuvel, Gary Hook,
	Kan Liang, linux-kernel, Stephane Eranian, Andi Kleen

On Thu, Feb 13, 2020 at 11:51:30PM -0800, Ian Rogers wrote:
>  
> -		*evt = perf_event_groups_next(*evt);
> +		next = perf_event_groups_next(itrs[0]);
> +		if (next) {
> +			min_heap_pop_push(&event_heap, &next,
> +					&perf_min_heap);
> +		} else
> +			min_heap_pop(&event_heap, &perf_min_heap);
>  	}

Like this:

@@ -3585,9 +3581,9 @@ static noinline int visit_groups_merge(s
 		if (ret)
 			return ret;
 
-		next = perf_event_groups_next(*evt);
-		if (next)
-			min_heap_pop_push(&event_heap, &next, &perf_min_heap);
+		*evt = perf_event_groups_next(*evt);
+		if (*evt)
+			min_heapify(&event_heap, 0, &perf_min_heap);
 		else
 			min_heap_pop(&event_heap, &perf_min_heap);
 	}

That's an 'obvious' replace and resort and obviates the need for that
weird pop that doesn't return nothing operation.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [tip: perf/core] perf/cgroup: Grow per perf_cpu_context heap storage
  2020-02-14  7:51       ` [PATCH v6 5/6] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
@ 2020-03-06 14:42         ` tip-bot2 for Ian Rogers
  0 siblings, 0 replies; 80+ messages in thread
From: tip-bot2 for Ian Rogers @ 2020-03-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ian Rogers, Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     c2283c9368d41063f2077cb58def02217360526d
Gitweb:        https://git.kernel.org/tip/c2283c9368d41063f2077cb58def02217360526d
Author:        Ian Rogers <irogers@google.com>
AuthorDate:    Thu, 13 Feb 2020 23:51:32 -08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 06 Mar 2020 11:57:00 +01:00

perf/cgroup: Grow per perf_cpu_context heap storage

Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com
---
 kernel/events/core.c | 47 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7529e76..8065949 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev,
 	rcu_read_unlock();
 }
 
+static int perf_cgroup_ensure_storage(struct perf_event *event,
+				struct cgroup_subsys_state *css)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_event **storage;
+	int cpu, heap_size, ret = 0;
+
+	/*
+	 * Allow storage to have sufficent space for an iterator for each
+	 * possibly nested cgroup plus an iterator for events with no cgroup.
+	 */
+	for (heap_size = 1; css; css = css->parent)
+		heap_size++;
+
+	for_each_possible_cpu(cpu) {
+		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		if (heap_size <= cpuctx->heap_size)
+			continue;
+
+		storage = kmalloc_node(heap_size * sizeof(struct perf_event *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+		if (!storage) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		raw_spin_lock_irq(&cpuctx->ctx.lock);
+		if (cpuctx->heap_size < heap_size) {
+			swap(cpuctx->heap, storage);
+			if (storage == cpuctx->heap_default)
+				storage = NULL;
+			cpuctx->heap_size = heap_size;
+		}
+		raw_spin_unlock_irq(&cpuctx->ctx.lock);
+
+		kfree(storage);
+	}
+
+	return ret;
+}
+
 static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		goto out;
 	}
 
+	ret = perf_cgroup_ensure_storage(event, css);
+	if (ret)
+		goto out;
+
 	cgrp = container_of(css, struct perf_cgroup, css);
 	event->cgrp = cgrp;
 
@@ -3440,6 +3485,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.nr = 0,
 			.size = cpuctx->heap_size,
 		};
+
+		lockdep_assert_held(&cpuctx->ctx.lock);
 	} else {
 		event_heap = (struct min_heap){
 			.data = itrs,

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [tip: perf/core] perf/core: Add per perf_cpu_context min_heap storage
  2020-02-14  7:51       ` [PATCH v6 4/6] perf: Add per perf_cpu_context min_heap storage Ian Rogers
@ 2020-03-06 14:42         ` tip-bot2 for Ian Rogers
  0 siblings, 0 replies; 80+ messages in thread
From: tip-bot2 for Ian Rogers @ 2020-03-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ian Rogers, Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     836196beb377e59e54ec9e04f7402076ef7a8bd8
Gitweb:        https://git.kernel.org/tip/836196beb377e59e54ec9e04f7402076ef7a8bd8
Author:        Ian Rogers <irogers@google.com>
AuthorDate:    Thu, 13 Feb 2020 23:51:31 -08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 06 Mar 2020 11:57:00 +01:00

perf/core: Add per perf_cpu_context min_heap storage

The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com
---
 include/linux/perf_event.h |  7 ++++++-
 kernel/events/core.c       | 43 +++++++++++++++++++++++++++----------
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 68e21e8..8768a39 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -862,6 +862,13 @@ struct perf_cpu_context {
 	int				sched_cb_usage;
 
 	int				online;
+	/*
+	 * Per-CPU storage for iterators used in visit_groups_merge. The default
+	 * storage is of size 2 to hold the CPU and any CPU event iterators.
+	 */
+	int				heap_size;
+	struct perf_event		**heap;
+	struct perf_event		*heap_default[2];
 };
 
 struct perf_output_handle {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ddfb06c..7529e76 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3423,22 +3423,34 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_event_groups *groups,
-				int cpu,
+static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+				struct perf_event_groups *groups, int cpu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
-	struct min_heap event_heap = {
-		.data = itrs,
-		.nr = 0,
-		.size = ARRAY_SIZE(itrs),
-	};
-	struct perf_event **evt = event_heap.data;
+	struct min_heap event_heap;
+	struct perf_event **evt;
 	int ret;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	if (cpuctx) {
+		event_heap = (struct min_heap){
+			.data = cpuctx->heap,
+			.nr = 0,
+			.size = cpuctx->heap_size,
+		};
+	} else {
+		event_heap = (struct min_heap){
+			.data = itrs,
+			.nr = 0,
+			.size = ARRAY_SIZE(itrs),
+		};
+		/* Events not within a CPU context may be on any CPU. */
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	}
+	evt = event_heap.data;
+
 	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
 	min_heapify_all(&event_heap, &perf_min_heap);
@@ -3492,7 +3504,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
 {
 	int can_add_hw = 1;
 
-	visit_groups_merge(&ctx->pinned_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->pinned_groups,
 			   smp_processor_id(),
 			   merge_sched_in, &can_add_hw);
 }
@@ -3503,7 +3518,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 {
 	int can_add_hw = 1;
 
-	visit_groups_merge(&ctx->flexible_groups,
+	if (ctx != &cpuctx->ctx)
+		cpuctx = NULL;
+
+	visit_groups_merge(cpuctx, &ctx->flexible_groups,
 			   smp_processor_id(),
 			   merge_sched_in, &can_add_hw);
 }
@@ -10364,6 +10382,9 @@ skip_type:
 		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
+
+		cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
+		cpuctx->heap = cpuctx->heap_default;
 	}
 
 got_cpu_context:

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [tip: perf/core] lib: Introduce generic min-heap
  2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
  2020-02-14 22:06         ` Randy Dunlap
  2020-02-17 16:29         ` Peter Zijlstra
@ 2020-03-06 14:42         ` tip-bot2 for Ian Rogers
  2 siblings, 0 replies; 80+ messages in thread
From: tip-bot2 for Ian Rogers @ 2020-03-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ian Rogers, Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     6e24628d78e4785385876125cba62315ca3b04b9
Gitweb:        https://git.kernel.org/tip/6e24628d78e4785385876125cba62315ca3b04b9
Author:        Ian Rogers <irogers@google.com>
AuthorDate:    Thu, 13 Feb 2020 23:51:29 -08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 06 Mar 2020 11:56:59 +01:00

lib: Introduce generic min-heap

Supports push, pop and converting an array into a heap. If the sense of
the compare function is inverted then it can provide a max-heap.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-3-irogers@google.com
---
 include/linux/min_heap.h | 134 ++++++++++++++++++++++++++-
 lib/Kconfig.debug        |  10 ++-
 lib/Makefile             |   1 +-
 lib/test_min_heap.c      | 194 ++++++++++++++++++++++++++++++++++++++-
 4 files changed, 339 insertions(+)
 create mode 100644 include/linux/min_heap.h
 create mode 100644 lib/test_min_heap.c

diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
new file mode 100644
index 0000000..4407783
--- /dev/null
+++ b/include/linux/min_heap.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MIN_HEAP_H
+#define _LINUX_MIN_HEAP_H
+
+#include <linux/bug.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+/**
+ * struct min_heap - Data structure to hold a min-heap.
+ * @data: Start of array holding the heap elements.
+ * @nr: Number of elements currently in the heap.
+ * @size: Maximum number of elements that can be held in current storage.
+ */
+struct min_heap {
+	void *data;
+	int nr;
+	int size;
+};
+
+/**
+ * struct min_heap_callbacks - Data/functions to customise the min_heap.
+ * @elem_size: The nr of each element in bytes.
+ * @less: Partial order function for this heap.
+ * @swp: Swap elements function.
+ */
+struct min_heap_callbacks {
+	int elem_size;
+	bool (*less)(const void *lhs, const void *rhs);
+	void (*swp)(void *lhs, void *rhs);
+};
+
+/* Sift the element at pos down the heap. */
+static __always_inline
+void min_heapify(struct min_heap *heap, int pos,
+		const struct min_heap_callbacks *func)
+{
+	void *left, *right, *parent, *smallest;
+	void *data = heap->data;
+
+	for (;;) {
+		if (pos * 2 + 1 >= heap->nr)
+			break;
+
+		left = data + ((pos * 2 + 1) * func->elem_size);
+		parent = data + (pos * func->elem_size);
+		smallest = parent;
+		if (func->less(left, smallest))
+			smallest = left;
+
+		if (pos * 2 + 2 < heap->nr) {
+			right = data + ((pos * 2 + 2) * func->elem_size);
+			if (func->less(right, smallest))
+				smallest = right;
+		}
+		if (smallest == parent)
+			break;
+		func->swp(smallest, parent);
+		if (smallest == left)
+			pos = (pos * 2) + 1;
+		else
+			pos = (pos * 2) + 2;
+	}
+}
+
+/* Floyd's approach to heapification that is O(nr). */
+static __always_inline
+void min_heapify_all(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	int i;
+
+	for (i = heap->nr / 2; i >= 0; i--)
+		min_heapify(heap, i, func);
+}
+
+/* Remove minimum element from the heap, O(log2(nr)). */
+static __always_inline
+void min_heap_pop(struct min_heap *heap,
+		const struct min_heap_callbacks *func)
+{
+	void *data = heap->data;
+
+	if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
+		return;
+
+	/* Place last element at the root (position 0) and then sift down. */
+	heap->nr--;
+	memcpy(data, data + (heap->nr * func->elem_size), func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/*
+ * Remove the minimum element and then push the given element. The
+ * implementation performs 1 sift (O(log2(nr))) and is therefore more
+ * efficient than a pop followed by a push that does 2.
+ */
+static __always_inline
+void min_heap_pop_push(struct min_heap *heap,
+		const void *element,
+		const struct min_heap_callbacks *func)
+{
+	memcpy(heap->data, element, func->elem_size);
+	min_heapify(heap, 0, func);
+}
+
+/* Push an element on to the heap, O(log2(nr)). */
+static __always_inline
+void min_heap_push(struct min_heap *heap, const void *element,
+		const struct min_heap_callbacks *func)
+{
+	void *data = heap->data;
+	void *child, *parent;
+	int pos;
+
+	if (WARN_ONCE(heap->nr >= heap->size, "Pushing on a full heap"))
+		return;
+
+	/* Place at the end of data. */
+	pos = heap->nr;
+	memcpy(data + (pos * func->elem_size), element, func->elem_size);
+	heap->nr++;
+
+	/* Sift child at pos up. */
+	for (; pos > 0; pos = (pos - 1) / 2) {
+		child = data + (pos * func->elem_size);
+		parent = data + ((pos - 1) / 2) * func->elem_size;
+		if (func->less(parent, child))
+			break;
+		func->swp(parent, child);
+	}
+}
+
+#endif /* _LINUX_MIN_HEAP_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 69def4a..f04b61c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1769,6 +1769,16 @@ config TEST_LIST_SORT
 
 	  If unsure, say N.
 
+config TEST_MIN_HEAP
+	tristate "Min heap test"
+	depends on DEBUG_KERNEL || m
+	help
+	  Enable this to turn on min heap function tests. This test is
+	  executed only once during system boot (so affects only boot time),
+	  or at module load time.
+
+	  If unsure, say N.
+
 config TEST_SORT
 	tristate "Array-based sort test"
 	depends on DEBUG_KERNEL || m
diff --git a/lib/Makefile b/lib/Makefile
index 611872c..09a8acb 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -67,6 +67,7 @@ CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
+obj-$(CONFIG_TEST_MIN_HEAP) += test_min_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
diff --git a/lib/test_min_heap.c b/lib/test_min_heap.c
new file mode 100644
index 0000000..d19c808
--- /dev/null
+++ b/lib/test_min_heap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "min_heap_test: " fmt
+
+/*
+ * Test cases for the min max heap.
+ */
+
+#include <linux/log2.h>
+#include <linux/min_heap.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+
+static __init bool less_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs < *(int *)rhs;
+}
+
+static __init bool greater_than(const void *lhs, const void *rhs)
+{
+	return *(int *)lhs > *(int *)rhs;
+}
+
+static __init void swap_ints(void *lhs, void *rhs)
+{
+	int temp = *(int *)lhs;
+
+	*(int *)lhs = *(int *)rhs;
+	*(int *)rhs = temp;
+}
+
+static __init int pop_verify_heap(bool min_heap,
+				struct min_heap *heap,
+				const struct min_heap_callbacks *funcs)
+{
+	int *values = heap->data;
+	int err = 0;
+	int last;
+
+	last = values[0];
+	min_heap_pop(heap, funcs);
+	while (heap->nr > 0) {
+		if (min_heap) {
+			if (last > values[0]) {
+				pr_err("error: expected %d <= %d\n", last,
+					values[0]);
+				err++;
+			}
+		} else {
+			if (last < values[0]) {
+				pr_err("error: expected %d >= %d\n", last,
+					values[0]);
+				err++;
+			}
+		}
+		last = values[0];
+		min_heap_pop(heap, funcs);
+	}
+	return err;
+}
+
+static __init int test_heapify_all(bool min_heap)
+{
+	int values[] = { 3, 1, 2, 4, 0x8000000, 0x7FFFFFF, 0,
+			 -3, -1, -2, -4, 0x8000000, 0x7FFFFFF };
+	struct min_heap heap = {
+		.data = values,
+		.nr = ARRAY_SIZE(values),
+		.size =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.less = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, err;
+
+	/* Test with known set of values. */
+	min_heapify_all(&heap, &funcs);
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+
+	/* Test with randomly generated values. */
+	heap.nr = ARRAY_SIZE(values);
+	for (i = 0; i < heap.nr; i++)
+		values[i] = get_random_int();
+
+	min_heapify_all(&heap, &funcs);
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.nr = 0,
+		.size =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.less = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	/* Test with randomly generated values. */
+	while (heap.nr < heap.size) {
+		temp = get_random_int();
+		min_heap_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static __init int test_heap_pop_push(bool min_heap)
+{
+	const int data[] = { 3, 1, 2, 4, 0x80000000, 0x7FFFFFFF, 0,
+			     -3, -1, -2, -4, 0x80000000, 0x7FFFFFFF };
+	int values[ARRAY_SIZE(data)];
+	struct min_heap heap = {
+		.data = values,
+		.nr = 0,
+		.size =  ARRAY_SIZE(values),
+	};
+	struct min_heap_callbacks funcs = {
+		.elem_size = sizeof(int),
+		.less = min_heap ? less_than : greater_than,
+		.swp = swap_ints,
+	};
+	int i, temp, err;
+
+	/* Fill values with data to pop and replace. */
+	temp = min_heap ? 0x80000000 : 0x7FFFFFFF;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with known set of values copied from data. */
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_pop_push(&heap, &data[i], &funcs);
+
+	err = pop_verify_heap(min_heap, &heap, &funcs);
+
+	heap.nr = 0;
+	for (i = 0; i < ARRAY_SIZE(data); i++)
+		min_heap_push(&heap, &temp, &funcs);
+
+	/* Test with randomly generated values. */
+	for (i = 0; i < ARRAY_SIZE(data); i++) {
+		temp = get_random_int();
+		min_heap_pop_push(&heap, &temp, &funcs);
+	}
+	err += pop_verify_heap(min_heap, &heap, &funcs);
+
+	return err;
+}
+
+static int __init test_min_heap_init(void)
+{
+	int err = 0;
+
+	err += test_heapify_all(true);
+	err += test_heapify_all(false);
+	err += test_heap_push(true);
+	err += test_heap_push(false);
+	err += test_heap_pop_push(true);
+	err += test_heap_pop_push(false);
+	if (err) {
+		pr_err("test failed with %d errors\n", err);
+		return -EINVAL;
+	}
+	pr_info("test passed\n");
+	return 0;
+}
+module_init(test_min_heap_init);
+
+static void __exit test_min_heap_exit(void)
+{
+	/* do nothing */
+}
+module_exit(test_min_heap_exit);
+
+MODULE_LICENSE("GPL");

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [tip: perf/core] perf/core: Use min_heap in visit_groups_merge()
  2020-02-14  7:51       ` [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge Ian Rogers
  2020-02-17 17:23         ` Peter Zijlstra
@ 2020-03-06 14:42         ` tip-bot2 for Ian Rogers
  1 sibling, 0 replies; 80+ messages in thread
From: tip-bot2 for Ian Rogers @ 2020-03-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ian Rogers, Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     6eef8a7116deae0706ba6d897c0d7dd887cd2be2
Gitweb:        https://git.kernel.org/tip/6eef8a7116deae0706ba6d897c0d7dd887cd2be2
Author:        Ian Rogers <irogers@google.com>
AuthorDate:    Thu, 13 Feb 2020 23:51:30 -08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 06 Mar 2020 11:56:59 +01:00

perf/core: Use min_heap in visit_groups_merge()

visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com
---
 kernel/events/core.c | 67 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index dceeeb1..ddfb06c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/min_heap.h>
 
 #include "internal.h"
 
@@ -3392,32 +3393,66 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
 }
 
-static int visit_groups_merge(struct perf_event_groups *groups, int cpu,
-			      int (*func)(struct perf_event *, void *), void *data)
+static bool perf_less_group_idx(const void *l, const void *r)
 {
-	struct perf_event **evt, *evt1, *evt2;
+	const struct perf_event *le = l, *re = r;
+
+	return le->group_index < re->group_index;
+}
+
+static void swap_ptr(void *l, void *r)
+{
+	void **lp = l, **rp = r;
+
+	swap(*lp, *rp);
+}
+
+static const struct min_heap_callbacks perf_min_heap = {
+	.elem_size = sizeof(struct perf_event *),
+	.less = perf_less_group_idx,
+	.swp = swap_ptr,
+};
+
+static void __heap_add(struct min_heap *heap, struct perf_event *event)
+{
+	struct perf_event **itrs = heap->data;
+
+	if (event) {
+		itrs[heap->nr] = event;
+		heap->nr++;
+	}
+}
+
+static noinline int visit_groups_merge(struct perf_event_groups *groups,
+				int cpu,
+				int (*func)(struct perf_event *, void *),
+				void *data)
+{
+	/* Space for per CPU and/or any CPU event iterators. */
+	struct perf_event *itrs[2];
+	struct min_heap event_heap = {
+		.data = itrs,
+		.nr = 0,
+		.size = ARRAY_SIZE(itrs),
+	};
+	struct perf_event **evt = event_heap.data;
 	int ret;
 
-	evt1 = perf_event_groups_first(groups, -1);
-	evt2 = perf_event_groups_first(groups, cpu);
+	__heap_add(&event_heap, perf_event_groups_first(groups, -1));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu));
 
-	while (evt1 || evt2) {
-		if (evt1 && evt2) {
-			if (evt1->group_index < evt2->group_index)
-				evt = &evt1;
-			else
-				evt = &evt2;
-		} else if (evt1) {
-			evt = &evt1;
-		} else {
-			evt = &evt2;
-		}
+	min_heapify_all(&event_heap, &perf_min_heap);
 
+	while (event_heap.nr) {
 		ret = func(*evt, data);
 		if (ret)
 			return ret;
 
 		*evt = perf_event_groups_next(*evt);
+		if (*evt)
+			min_heapify(&event_heap, 0, &perf_min_heap);
+		else
+			min_heap_pop(&event_heap, &perf_min_heap);
 	}
 
 	return 0;

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [tip: perf/core] perf/cgroup: Reorder perf_cgroup_connect()
  2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
  2020-02-14 16:11         ` Shuah Khan
@ 2020-03-06 14:42         ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 80+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-03-06 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ian Rogers, Ingo Molnar, x86, LKML

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     98add2af89bbfe8241e189b490fd91e5751c7900
Gitweb:        https://git.kernel.org/tip/98add2af89bbfe8241e189b490fd91e5751c7900
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 13 Feb 2020 23:51:28 -08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 06 Mar 2020 11:56:58 +01:00

perf/cgroup: Reorder perf_cgroup_connect()

Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com
---
 kernel/events/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b7eaaba..dceeeb1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10774,12 +10774,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
-
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
@@ -10801,6 +10795,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_pmu;
 	}
 
+	if (cgroup_fd != -1) {
+		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+		if (err)
+			goto err_pmu;
+	}
+
 	err = exclusive_event_init(event);
 	if (err)
 		goto err_pmu;
@@ -10861,12 +10861,12 @@ err_per_task:
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	if (event->hw.target)

^ permalink raw reply related	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2020-03-06 14:44 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-14  0:30 [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
2019-11-14  0:30 ` [PATCH v3 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
2019-11-14  8:50   ` Peter Zijlstra
2019-11-14  0:30 ` [PATCH v3 02/10] lib: introduce generic min max heap Ian Rogers
2019-11-14  9:32   ` Peter Zijlstra
2019-11-14  9:35   ` Peter Zijlstra
2019-11-17 18:28   ` Joe Perches
2019-11-18  8:40     ` Peter Zijlstra
2019-11-18 11:50       ` Joe Perches
2019-11-18 12:21         ` Peter Zijlstra
2019-11-14  0:30 ` [PATCH v3 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
2019-11-14  9:39   ` Peter Zijlstra
2019-11-14  0:30 ` [PATCH v3 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
2019-11-14  9:51   ` Peter Zijlstra
2019-11-16  1:19     ` Ian Rogers
2019-11-14  0:30 ` [PATCH v3 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
2019-11-14  9:54   ` Peter Zijlstra
2019-11-14  0:30 ` [PATCH v3 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
2019-11-14  0:30 ` [PATCH v3 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
2019-11-14 10:03   ` Peter Zijlstra
2019-11-16  1:20     ` Ian Rogers
2019-11-14  0:30 ` [PATCH v3 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
2019-11-14 10:25   ` Peter Zijlstra
2019-11-16  1:20     ` Ian Rogers
2019-11-18  8:37       ` Peter Zijlstra
2019-11-14  0:30 ` [PATCH v3 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
2019-11-14  0:30 ` [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
2019-11-14 10:43   ` Peter Zijlstra
2019-11-14 13:46     ` Liang, Kan
2019-11-14 13:57       ` Peter Zijlstra
2019-11-14 15:16         ` Liang, Kan
2019-11-14 15:24           ` Liang, Kan
2019-11-14 20:49             ` Liang, Kan
2019-11-14  0:42 ` [PATCH v3 00/10] Optimize cgroup context switch Ian Rogers
2019-11-14 10:45 ` Peter Zijlstra
2019-11-14 18:17   ` Ian Rogers
2019-12-06 23:16     ` Ian Rogers
2019-11-16  1:18 ` [PATCH v4 " Ian Rogers
2019-11-16  1:18   ` [PATCH v4 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
2019-11-16  1:18   ` [PATCH v4 02/10] lib: introduce generic min max heap Ian Rogers
2019-11-21 11:11     ` Joe Perches
2019-11-16  1:18   ` [PATCH v4 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
2019-11-16  1:18   ` [PATCH v4 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
2019-11-16  1:18   ` [PATCH v4 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
2019-11-16  1:18   ` [PATCH v4 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
2019-11-16  1:18   ` [PATCH v4 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
2019-11-16  1:18   ` [PATCH v4 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
2019-11-16  1:18   ` [PATCH v4 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
2019-11-16  1:18   ` [PATCH v4 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
2019-12-06 23:15   ` [PATCH v5 00/10] Optimize cgroup context switch Ian Rogers
2019-12-06 23:15     ` [PATCH v5 01/10] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
2019-12-06 23:15     ` [PATCH v5 02/10] lib: introduce generic min-heap Ian Rogers
2019-12-06 23:15     ` [PATCH v5 03/10] perf: Use min_max_heap in visit_groups_merge Ian Rogers
2019-12-08  7:10       ` kbuild test robot
2019-12-06 23:15     ` [PATCH v5 04/10] perf: Add per perf_cpu_context min_heap storage Ian Rogers
2019-12-06 23:15     ` [PATCH v5 05/10] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
2019-12-06 23:15     ` [PATCH v5 06/10] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
2019-12-06 23:15     ` [PATCH v5 07/10] perf: simplify and rename visit_groups_merge Ian Rogers
2019-12-06 23:15     ` [PATCH v5 08/10] perf: cache perf_event_groups_first for cgroups Ian Rogers
2019-12-06 23:15     ` [PATCH v5 09/10] perf: optimize event_filter_match during sched_in Ian Rogers
2019-12-06 23:15     ` [PATCH v5 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch Ian Rogers
2020-02-14  7:51     ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
2020-02-14  7:51       ` [PATCH v6 1/6] perf/cgroup: Reorder perf_cgroup_connect() Ian Rogers
2020-02-14 16:11         ` Shuah Khan
2020-02-14 17:37           ` Peter Zijlstra
2020-03-06 14:42         ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2020-02-14  7:51       ` [PATCH v6 2/6] lib: introduce generic min-heap Ian Rogers
2020-02-14 22:06         ` Randy Dunlap
2020-02-17 16:29         ` Peter Zijlstra
2020-03-06 14:42         ` [tip: perf/core] lib: Introduce " tip-bot2 for Ian Rogers
2020-02-14  7:51       ` [PATCH v6 3/6] perf: Use min_heap in visit_groups_merge Ian Rogers
2020-02-17 17:23         ` Peter Zijlstra
2020-03-06 14:42         ` [tip: perf/core] perf/core: Use min_heap in visit_groups_merge() tip-bot2 for Ian Rogers
2020-02-14  7:51       ` [PATCH v6 4/6] perf: Add per perf_cpu_context min_heap storage Ian Rogers
2020-03-06 14:42         ` [tip: perf/core] perf/core: " tip-bot2 for Ian Rogers
2020-02-14  7:51       ` [PATCH v6 5/6] perf/cgroup: Grow per perf_cpu_context heap storage Ian Rogers
2020-03-06 14:42         ` [tip: perf/core] " tip-bot2 for Ian Rogers
2020-02-14  7:51       ` [PATCH v6 6/6] perf/cgroup: Order events in RB tree by cgroup id Ian Rogers
2020-02-14 19:32       ` [PATCH v6 0/6] Optimize cgroup context switch Ian Rogers
2020-02-17 16:18       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).