linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] perf core: Sharing events with multiple cgroups
@ 2021-03-23 16:21 Namhyung Kim
  2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
  2021-03-23 16:21 ` [PATCH 2/2] perf/core: Support reading group events with shared cgroups Namhyung Kim
  0 siblings, 2 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-23 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

Hello,

This work is to make perf stat more scalable with a lot of cgroups.

Currently we need to open a separate perf_event to count an event in a
cgroup.  For a big machine, this requires lots of events like

  256 cpu x 8 events x 200 cgroups = 409600 events

This is very wasteful and not scalable.  In this case, the perf stat
actually counts exactly same events for each cgroup.  I think we can
just use a single event to measure all cgroups running on that cpu.

So I added new ioctl commands to add per-cgroup counters to an
existing perf_event and to read the per-cgroup counters from the
event.  The per-cgroup counters are updated during the context switch
if tasks' cgroups are different (and no need to change the HW PMU).
It keeps the counters in a hash table with cgroup id as a key.

With this change, average processing time of my internal test workload
which runs tasks in a different cgroup and communicates by pipes
dropped from 11.3 usec to 5.8 usec.

Thanks,
Namhyung


Namhyung Kim (2):
  perf/core: Share an event with multiple cgroups
  perf/core: Support reading group events with shared cgroups

 include/linux/perf_event.h      |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c            | 588 ++++++++++++++++++++++++++++++--
 3 files changed, 585 insertions(+), 27 deletions(-)

-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-23 16:21 [RFC 0/2] perf core: Sharing events with multiple cgroups Namhyung Kim
@ 2021-03-23 16:21 ` Namhyung Kim
  2021-03-24  0:30   ` Song Liu
                     ` (2 more replies)
  2021-03-23 16:21 ` [PATCH 2/2] perf/core: Support reading group events with shared cgroups Namhyung Kim
  1 sibling, 3 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-23 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu, Tejun Heo

As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run.  To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups.  All we need is a separate counter (and
two timing variables) for each cgroup.  I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different.  And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

 * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
     64-bit array to attach given cgroups.  The first element is a
     number of cgroups in the buffer, and the rest is a list of cgroup
     ids to add a cgroup info to the given event.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
     array to get the event counter values.  The first element is size
     of the array in byte, and the second element is a cgroup id to
     read.  The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple.  The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 include/linux/perf_event.h      |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c            | 474 ++++++++++++++++++++++++++++++--
 3 files changed, 471 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..2760f3b07534 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,18 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
+
+	/* to share an event for multiple cgroups */
+	struct hlist_head		*cgrp_node_hash;
+	struct perf_cgroup_node		*cgrp_node_entries;
+	int				nr_cgrp_nodes;
+	int				cgrp_node_hash_bits;
+
+	struct list_head		cgrp_node_entry;
+
+	u64				cgrp_node_count;
+	u64				cgrp_node_time_enabled;
+	u64				cgrp_node_time_running;
 #endif
 
 #ifdef CONFIG_SECURITY
@@ -780,6 +792,14 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_cgroup_node {
+	struct hlist_node		node;
+	u64				id;
+	u64				count;
+	u64				time_enabled;
+	u64				time_running;
+	u64				padding[2];
+};
 
 struct perf_event_groups {
 	struct rb_root	tree;
@@ -843,6 +863,8 @@ struct perf_event_context {
 	int				pin_count;
 #ifdef CONFIG_CGROUP_PERF
 	int				nr_cgroups;	 /* cgroup evts */
+	struct list_head		cgrp_node_list;
+	struct list_head		cgrp_ctx_entry;
 #endif
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT		_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF		_IOWR('$', 10, struct perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES	_IOW('$', 11, struct perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP		_IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP		_IOWR('$', 13, __u64 *)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..38c26a23418a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@ enum event_type_t {
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
  */
 
+static void perf_sched_enable(void);
 static void perf_sched_delayed(struct work_struct *work);
 DEFINE_STATIC_KEY_FALSE(perf_sched_events);
 static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
@@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
 	return 1;
 }
 
+#ifdef CONFIG_CGROUP_PERF
+static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
+
+static bool event_can_attach_cgroup(struct perf_event *event)
+{
+	if (is_sampling_event(event))
+		return false;
+	if (event->attach_state & PERF_ATTACH_TASK)
+		return false;
+	if (is_cgroup_event(event))
+		return false;
+
+	return true;
+}
+
+static bool event_has_cgroup_node(struct perf_event *event)
+{
+	return event->nr_cgrp_nodes > 0;
+}
+
+static struct perf_cgroup_node *
+find_cgroup_node(struct perf_event *event, u64 cgrp_id)
+{
+	struct perf_cgroup_node *cgrp_node;
+	int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
+
+	hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
+		if (cgrp_node->id == cgrp_id)
+			return cgrp_node;
+	}
+
+	return NULL;
+}
+
+static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+	u64 delta_count, delta_time_enabled, delta_time_running;
+	int i;
+
+	if (event->cgrp_node_count == 0)
+		goto out;
+
+	delta_count = local64_read(&event->count) - event->cgrp_node_count;
+	delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
+	delta_time_running = event->total_time_running - event->cgrp_node_time_running;
+
+	/* account delta to all ancestor cgroups */
+	for (i = 0; i <= cgrp->level; i++) {
+		struct perf_cgroup_node *node;
+
+		node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
+		if (node) {
+			node->count += delta_count;
+			node->time_enabled += delta_time_enabled;
+			node->time_running += delta_time_running;
+		}
+	}
+
+out:
+	event->cgrp_node_count = local64_read(&event->count);
+	event->cgrp_node_time_enabled = event->total_time_enabled;
+	event->cgrp_node_time_running = event->total_time_running;
+}
+
+static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+	if (event->state == PERF_EVENT_STATE_ACTIVE)
+		event->pmu->read(event);
+
+	perf_event_update_time(event);
+	perf_update_cgroup_node(event, cgrp);
+}
+
+/* this is called from context switch */
+static void update_cgroup_node_events(struct perf_event_context *ctx,
+				      struct cgroup *cgrp)
+{
+	struct perf_event *event;
+
+	lockdep_assert_held(&ctx->lock);
+
+	if (ctx->is_active & EVENT_TIME)
+		update_context_time(ctx);
+
+	list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
+		update_cgroup_node(event, cgrp);
+}
+
+static void cgroup_node_sched_out(struct task_struct *task)
+{
+	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
+	struct perf_event_context *ctx;
+
+	list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
+		raw_spin_lock(&ctx->lock);
+		update_cgroup_node_events(ctx, cgrp->css.cgroup);
+		raw_spin_unlock(&ctx->lock);
+	}
+}
+
+/* this is called from the when event is enabled/disabled */
+static void perf_add_cgrp_node_list(struct perf_event *event,
+				    struct perf_event_context *ctx)
+{
+	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+	bool is_first;
+
+	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(&ctx->lock);
+
+	is_first = list_empty(&ctx->cgrp_node_list);
+	list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
+
+	if (is_first)
+		list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
+
+	update_cgroup_node(event, cgrp->css.cgroup);
+}
+
+static void perf_del_cgrp_node_list(struct perf_event *event,
+				    struct perf_event_context *ctx)
+{
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+
+	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(&ctx->lock);
+
+	update_cgroup_node(event, cgrp->css.cgroup);
+	/* to refresh delta when it's enabled */
+	event->cgrp_node_count = 0;
+
+	list_del(&event->cgrp_node_entry);
+
+	if (list_empty(&ctx->cgrp_node_list))
+		list_del(&ctx->cgrp_ctx_entry);
+}
+
+static void perf_attach_cgroup_node(struct perf_event *event,
+				    struct perf_cpu_context *cpuctx,
+				    struct perf_event_context *ctx,
+				    void *data)
+{
+	if (ctx->is_active & EVENT_TIME)
+		update_context_time(ctx);
+
+	perf_add_cgrp_node_list(event, ctx);
+}
+
+#define MIN_CGRP_NODE_HASH  4
+#define MAX_CGRP_NODE_HASH  (4 * 1024)
+
+/* this is called from ioctl */
+static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
+					 u64 *cgroup_ids)
+{
+	struct perf_cgroup_node *cgrp_node;
+	struct perf_event_context *ctx = event->ctx;
+	struct hlist_head *cgrp_node_hash;
+	int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
+	unsigned long flags;
+	bool is_first = true;
+	bool enabled;
+	int i, nr_hash;
+	int hash_bits;
+
+	if (nr_cgrps < MIN_CGRP_NODE_HASH)
+		nr_hash = MIN_CGRP_NODE_HASH;
+	else
+		nr_hash = roundup_pow_of_two(nr_cgrps);
+	hash_bits = ilog2(nr_hash);
+
+	cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
+				      GFP_KERNEL, node);
+	if (cgrp_node_hash == NULL)
+		return -ENOMEM;
+
+	cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
+	if (cgrp_node == NULL) {
+		kfree(cgrp_node_hash);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < (int)nr_cgrps; i++) {
+		int key = hash_64(cgroup_ids[i], hash_bits);
+
+		cgrp_node[i].id = cgroup_ids[i];
+		hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
+	}
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
+
+	if (event->nr_cgrp_nodes != 0) {
+		kfree(event->cgrp_node_hash);
+		kfree(event->cgrp_node_entries);
+		is_first = false;
+	}
+
+	event->cgrp_node_hash = cgrp_node_hash;
+	event->cgrp_node_entries = cgrp_node;
+	event->cgrp_node_hash_bits = hash_bits;
+	event->nr_cgrp_nodes = nr_cgrps;
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	if (is_first && enabled)
+		event_function_call(event, perf_attach_cgroup_node, NULL);
+
+	return 0;
+}
+
+static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
+{
+	struct perf_event_context *ctx = event->ctx;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	if (event_has_cgroup_node(event)) {
+		if (!atomic_add_unless(&perf_sched_count, -1, 1))
+			schedule_delayed_work(&perf_sched_work, HZ);
+	}
+
+	kfree(event->cgrp_node_hash);
+	kfree(event->cgrp_node_entries);
+	event->nr_cgrp_nodes = 0;
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+static int perf_event_read(struct perf_event *event, bool group);
+
+static void __perf_read_cgroup_node(struct perf_event *event)
+{
+	struct perf_cgroup *cgrp;
+
+	if (event_has_cgroup_node(event)) {
+		cgrp = perf_cgroup_from_task(current, NULL);
+		perf_update_cgroup_node(event, cgrp->css.cgroup);
+	}
+}
+
+static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
+				       u64 cgrp_id, char __user *buf)
+{
+	struct perf_cgroup_node *cgrp;
+	struct perf_event_context *ctx = event->ctx;
+	unsigned long flags;
+	u64 read_format = event->attr.read_format;
+	u64 values[4];
+	int n = 0;
+
+	/* update event count and times (possibly run on other cpu) */
+	(void)perf_event_read(event, false);
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	cgrp = find_cgroup_node(event, cgrp_id);
+	if (cgrp == NULL) {
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		return -ENOENT;
+	}
+
+	values[n++] = cgrp->count;
+	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+		values[n++] = cgrp->time_enabled;
+	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+		values[n++] = cgrp->time_running;
+	if (read_format & PERF_FORMAT_ID)
+		values[n++] = primary_event_id(event);
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	if (copy_to_user(buf, values, n * sizeof(u64)))
+		return -EFAULT;
+
+	return n * sizeof(u64);
+}
+#else  /* !CONFIG_CGROUP_PERF */
+static inline bool event_can_attach_cgroup(struct perf_event *event)
+{
+	return false;
+}
+
+static inline bool event_has_cgroup_node(struct perf_event *event)
+{
+	return false;
+}
+
+static void cgroup_node_sched_out(struct task_struct *task) {}
+
+static inline void perf_add_cgrp_node_list(struct perf_event *event,
+					   struct perf_event_context *ctx) {}
+static inline void perf_del_cgrp_node_list(struct perf_event *event,
+					   struct perf_event_context *ctx) {}
+
+#define MAX_CGRP_NODE_HASH  1
+static inline int perf_event_attach_cgroup_node(struct perf_event *event,
+						u64 nr_cgrps, u64 *cgrp_ids)
+{
+	return -ENODEV;
+}
+
+static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
+static inline  void __perf_read_cgroup_node(struct perf_event *event) {}
+
+static inline int perf_event_read_cgroup_node(struct perf_event *event,
+					      u64 cgrp_id, char __user *buf)
+{
+	return -EINVAL;
+}
+#endif  /* CONFIG_CGROUP_PERF */
+
 static inline struct list_head *get_event_list(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -2407,6 +2724,7 @@ static void __perf_event_disable(struct perf_event *event,
 
 	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 	perf_cgroup_event_disable(event, ctx);
+	perf_del_cgrp_node_list(event, ctx);
 }
 
 /*
@@ -2946,6 +3264,7 @@ static void __perf_event_enable(struct perf_event *event,
 
 	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 	perf_cgroup_event_enable(event, ctx);
+	perf_add_cgrp_node_list(event, ctx);
 
 	if (!ctx->is_active)
 		return;
@@ -3568,6 +3887,11 @@ void __perf_event_task_sched_out(struct task_struct *task,
 	 */
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
 		perf_cgroup_sched_out(task, next);
+
+	if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
+	    perf_cgroup_from_task(task, NULL) !=
+	    perf_cgroup_from_task(next, NULL))
+		cgroup_node_sched_out(task);
 }
 
 /*
@@ -4268,6 +4592,7 @@ static void __perf_event_read(void *info)
 
 	if (!data->group) {
 		pmu->read(event);
+		__perf_read_cgroup_node(event);
 		data->ret = 0;
 		goto unlock;
 	}
@@ -4283,6 +4608,7 @@ static void __perf_event_read(void *info)
 			 * sibling could be on different (eg: software) PMU.
 			 */
 			sub->pmu->read(sub);
+			__perf_read_cgroup_node(sub);
 		}
 	}
 
@@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->event_list);
 	INIT_LIST_HEAD(&ctx->pinned_active);
 	INIT_LIST_HEAD(&ctx->flexible_active);
+	INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
+	INIT_LIST_HEAD(&ctx->cgrp_node_list);
 	refcount_set(&ctx->refcount, 1);
 }
 
@@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
+	perf_event_destroy_cgroup_nodes(event);
+
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
 			put_callchain_buffers();
@@ -5571,6 +5901,58 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 
 		return perf_event_modify_attr(event,  &new_attr);
 	}
+
+	case PERF_EVENT_IOC_ATTACH_CGROUP: {
+		u64 nr_cgrps;
+		u64 *cgrp_buf;
+		size_t cgrp_bufsz;
+		int ret;
+
+		if (!event_can_attach_cgroup(event))
+			return -EINVAL;
+
+		if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
+				   sizeof(nr_cgrps)))
+			return -EFAULT;
+
+		if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
+			return -EINVAL;
+
+		cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
+
+		cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
+		if (cgrp_buf == NULL)
+			return -ENOMEM;
+
+		if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
+				   cgrp_bufsz)) {
+			kfree(cgrp_buf);
+			return -EFAULT;
+		}
+
+		ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
+
+		kfree(cgrp_buf);
+		return ret;
+	}
+
+	case PERF_EVENT_IOC_READ_CGROUP: {
+		u64 read_size, cgrp_id;
+
+		if (!event_can_attach_cgroup(event))
+			return -EINVAL;
+
+		if (copy_from_user(&read_size, (u64 __user *)arg,
+				   sizeof(read_size)))
+			return -EFAULT;
+		if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
+				   sizeof(cgrp_id)))
+			return -EFAULT;
+
+		return perf_event_read_cgroup_node(event, read_size, cgrp_id,
+						   (char __user *)(arg + 16));
+	}
+
 	default:
 		return -ENOTTY;
 	}
@@ -5583,10 +5965,39 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	return 0;
 }
 
+static void perf_sched_enable(void)
+{
+	/*
+	 * We need the mutex here because static_branch_enable()
+	 * must complete *before* the perf_sched_count increment
+	 * becomes visible.
+	 */
+	if (atomic_inc_not_zero(&perf_sched_count))
+		return;
+
+	mutex_lock(&perf_sched_mutex);
+	if (!atomic_read(&perf_sched_count)) {
+		static_branch_enable(&perf_sched_events);
+		/*
+		 * Guarantee that all CPUs observe they key change and
+		 * call the perf scheduling hooks before proceeding to
+		 * install events that need them.
+		 */
+		synchronize_rcu();
+	}
+	/*
+	 * Now that we have waited for the sync_sched(), allow further
+	 * increments to by-pass the mutex.
+	 */
+	atomic_inc(&perf_sched_count);
+	mutex_unlock(&perf_sched_mutex);
+}
+
 static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct perf_event *event = file->private_data;
 	struct perf_event_context *ctx;
+	bool do_sched_enable = false;
 	long ret;
 
 	/* Treat ioctl like writes as it is likely a mutating operation. */
@@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return ret;
 
 	ctx = perf_event_ctx_lock(event);
+	/* ATTACH_CGROUP requires context switch callback */
+	if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
+		do_sched_enable = true;
 	ret = _perf_ioctl(event, cmd, arg);
 	perf_event_ctx_unlock(event, ctx);
 
+	/*
+	 * Due to the circular lock dependency, it cannot call
+	 * static_branch_enable() under the ctx->mutex.
+	 */
+	if (do_sched_enable && ret >= 0)
+		perf_sched_enable();
+
 	return ret;
 }
 
@@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
 	if (event->attr.text_poke)
 		atomic_inc(&nr_text_poke_events);
 
-	if (inc) {
-		/*
-		 * We need the mutex here because static_branch_enable()
-		 * must complete *before* the perf_sched_count increment
-		 * becomes visible.
-		 */
-		if (atomic_inc_not_zero(&perf_sched_count))
-			goto enabled;
-
-		mutex_lock(&perf_sched_mutex);
-		if (!atomic_read(&perf_sched_count)) {
-			static_branch_enable(&perf_sched_events);
-			/*
-			 * Guarantee that all CPUs observe they key change and
-			 * call the perf scheduling hooks before proceeding to
-			 * install events that need them.
-			 */
-			synchronize_rcu();
-		}
-		/*
-		 * Now that we have waited for the sync_sched(), allow further
-		 * increments to by-pass the mutex.
-		 */
-		atomic_inc(&perf_sched_count);
-		mutex_unlock(&perf_sched_mutex);
-	}
-enabled:
+	if (inc)
+		perf_sched_enable();
 
 	account_event_cpu(event, event->cpu);
 
@@ -13008,6 +13404,7 @@ static void __init perf_event_init_all_cpus(void)
 
 #ifdef CONFIG_CGROUP_PERF
 		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
+		INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
 #endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
 	}
@@ -13218,6 +13615,28 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static int __perf_cgroup_update_node(void *info)
+{
+	struct task_struct *task = info;
+
+	rcu_read_lock();
+	cgroup_node_sched_out(task);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(task, css, tset)
+		task_function_call(task, __perf_cgroup_update_node, task);
+
+	return 0;
+}
+
 static int __perf_cgroup_move(void *info)
 {
 	struct task_struct *task = info;
@@ -13240,6 +13659,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
 	.css_free	= perf_cgroup_css_free,
 	.css_online	= perf_cgroup_css_online,
+	.can_attach	= perf_cgroup_can_attach,
 	.attach		= perf_cgroup_attach,
 	/*
 	 * Implicitly enable on dfl hierarchy so that perf events can
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] perf/core: Support reading group events with shared cgroups
  2021-03-23 16:21 [RFC 0/2] perf core: Sharing events with multiple cgroups Namhyung Kim
  2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
@ 2021-03-23 16:21 ` Namhyung Kim
  2021-03-28 17:31   ` Song Liu
  1 sibling, 1 reply; 16+ messages in thread
From: Namhyung Kim @ 2021-03-23 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

This enables reading event group's counter values together with a
PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
Users should give a correct size of buffer to be read.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 116 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 38c26a23418a..3225177e54d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2232,13 +2232,24 @@ static void perf_add_cgrp_node_list(struct perf_event *event,
 {
 	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
 	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+	struct perf_event *sibling;
 	bool is_first;
 
 	lockdep_assert_irqs_disabled();
 	lockdep_assert_held(&ctx->lock);
 
+	/* only group leader can be added directly */
+	if (event->group_leader != event)
+		return;
+
+	if (!event_has_cgroup_node(event))
+		return;
+
 	is_first = list_empty(&ctx->cgrp_node_list);
+
 	list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
+	for_each_sibling_event(sibling, event)
+		list_add_tail(&sibling->cgrp_node_entry, &ctx->cgrp_node_list);
 
 	if (is_first)
 		list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
@@ -2250,15 +2261,25 @@ static void perf_del_cgrp_node_list(struct perf_event *event,
 				    struct perf_event_context *ctx)
 {
 	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+	struct perf_event *sibling;
 
 	lockdep_assert_irqs_disabled();
 	lockdep_assert_held(&ctx->lock);
 
+	/* only group leader can be deleted directly */
+	if (event->group_leader != event)
+		return;
+
+	if (!event_has_cgroup_node(event))
+		return;
+
 	update_cgroup_node(event, cgrp->css.cgroup);
 	/* to refresh delta when it's enabled */
 	event->cgrp_node_count = 0;
 
 	list_del(&event->cgrp_node_entry);
+	for_each_sibling_event(sibling, event)
+		list_del(&sibling->cgrp_node_entry);
 
 	if (list_empty(&ctx->cgrp_node_list))
 		list_del(&ctx->cgrp_ctx_entry);
@@ -2333,7 +2354,7 @@ static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
 
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
-	if (is_first && enabled)
+	if (is_first && enabled && event->group_leader == event)
 		event_function_call(event, perf_attach_cgroup_node, NULL);
 
 	return 0;
@@ -2370,8 +2391,8 @@ static void __perf_read_cgroup_node(struct perf_event *event)
 	}
 }
 
-static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
-				       u64 cgrp_id, char __user *buf)
+static int perf_event_read_cgrp_node_one(struct perf_event *event, u64 cgrp_id,
+					 char __user *buf)
 {
 	struct perf_cgroup_node *cgrp;
 	struct perf_event_context *ctx = event->ctx;
@@ -2406,6 +2427,91 @@ static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
 
 	return n * sizeof(u64);
 }
+
+static int perf_event_read_cgrp_node_sibling(struct perf_event *event,
+					     u64 read_format, u64 cgrp_id,
+					     u64 *values)
+{
+	struct perf_cgroup_node *cgrp;
+	int n = 0;
+
+	cgrp = find_cgroup_node(event, cgrp_id);
+	if (cgrp == NULL)
+		return (read_format & PERF_FORMAT_ID) ? 2 : 1;
+
+	values[n++] = cgrp->count;
+	if (read_format & PERF_FORMAT_ID)
+		values[n++] = primary_event_id(event);
+	return n;
+}
+
+static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
+					   char __user *buf)
+{
+	struct perf_cgroup_node *cgrp;
+	struct perf_event_context *ctx = event->ctx;
+	struct perf_event *sibling;
+	u64 read_format = event->attr.read_format;
+	unsigned long flags;
+	u64 *values;
+	int n = 1;
+	int ret;
+
+	values = kzalloc(event->read_size, GFP_KERNEL);
+	if (!values)
+		return -ENOMEM;
+
+	values[0] = 1 + event->nr_siblings;
+
+	/* update event count and times (possibly run on other cpu) */
+	(void)perf_event_read(event, true);
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	cgrp = find_cgroup_node(event, cgrp_id);
+	if (cgrp == NULL) {
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		kfree(values);
+		return -ENOENT;
+	}
+
+	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+		values[n++] = cgrp->time_enabled;
+	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+		values[n++] = cgrp->time_running;
+
+	values[n++] = cgrp->count;
+	if (read_format & PERF_FORMAT_ID)
+		values[n++] = primary_event_id(event);
+
+	for_each_sibling_event(sibling, event) {
+		n += perf_event_read_cgrp_node_sibling(sibling, read_format,
+						       cgrp_id, &values[n]);
+	}
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	ret = copy_to_user(buf, values, n * sizeof(u64));
+	kfree(values);
+	if (ret)
+		return -EFAULT;
+
+	return n * sizeof(u64);
+}
+
+static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
+				       u64 cgrp_id, char __user *buf)
+{
+	u64 read_format = event->attr.read_format;
+
+	if (read_size < event->read_size + 2 * sizeof(u64))
+		return -EINVAL;
+
+	if (read_format & PERF_FORMAT_GROUP)
+		return perf_event_read_cgrp_node_group(event, cgrp_id, buf);
+
+	return perf_event_read_cgrp_node_one(event, cgrp_id, buf);
+}
 #else  /* !CONFIG_CGROUP_PERF */
 static inline bool event_can_attach_cgroup(struct perf_event *event)
 {
@@ -2511,6 +2617,7 @@ static void perf_group_detach(struct perf_event *event)
 			if (sibling->state == PERF_EVENT_STATE_ACTIVE)
 				list_add_tail(&sibling->active_list, get_event_list(sibling));
 		}
+		perf_add_cgrp_node_list(sibling, event->ctx);
 
 		WARN_ON_ONCE(sibling->ctx != event->ctx);
 	}
@@ -2654,6 +2761,9 @@ __perf_remove_from_context(struct perf_event *event,
 		perf_group_detach(event);
 	list_del_event(event, ctx);
 
+	if (event->state > PERF_EVENT_STATE_OFF)
+		perf_del_cgrp_node_list(event, ctx);
+
 	if (!ctx->nr_events && ctx->is_active) {
 		ctx->is_active = 0;
 		ctx->rotate_necessary = 0;
@@ -3112,6 +3222,9 @@ static int  __perf_install_in_context(void *info)
 		reprogram = cgroup_is_descendant(cgrp->css.cgroup,
 					event->cgrp->css.cgroup);
 	}
+
+	if (event->state > PERF_EVENT_STATE_OFF)
+		perf_add_cgrp_node_list(event, ctx);
 #endif
 
 	if (reprogram) {
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
@ 2021-03-24  0:30   ` Song Liu
  2021-03-24  1:06     ` Namhyung Kim
  2021-03-25  0:55   ` Song Liu
  2021-03-28 17:17   ` Song Liu
  2 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-03-24  0:30 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo



> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
>     64-bit array to attach given cgroups.  The first element is a
>     number of cgroups in the buffer, and the rest is a list of cgroup
>     ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
>     array to get the event counter values.  The first element is size
>     of the array in byte, and the second element is a cgroup id to
>     read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.

This is very interesting idea! 

Could you please add some description of the relationship among 
perf_event and contexts? The code is a little confusing. For example,
why do we need cgroup_ctx_list? 

Thanks,
Song 

[...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-24  0:30   ` Song Liu
@ 2021-03-24  1:06     ` Namhyung Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-24  1:06 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo

Hi Song,

On Wed, Mar 24, 2021 at 9:30 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >     64-bit array to attach given cgroups.  The first element is a
> >     number of cgroups in the buffer, and the rest is a list of cgroup
> >     ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >     array to get the event counter values.  The first element is size
> >     of the array in byte, and the second element is a cgroup id to
> >     read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
>
> This is very interesting idea!

Thanks!

>
> Could you please add some description of the relationship among
> perf_event and contexts? The code is a little confusing. For example,
> why do we need cgroup_ctx_list?

Sure, a perf_event belongs to an event context (hw or sw, mostly) which
takes care of multiplexing, timing, locking and so on.  So many of the
fields in the perf_event are protected by the context lock.  A context has
a list of perf_events and there are per-cpu contexts and per-task contexts.

The cgroup_ctx_list is to traverse contexts (in that cpu) only having
perf_events with attached cgroups.

Hope this makes it clear.  Please let me know if you need more. :)

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
  2021-03-24  0:30   ` Song Liu
@ 2021-03-25  0:55   ` Song Liu
  2021-03-25  2:44     ` Namhyung Kim
  2021-03-25 12:57     ` Arnaldo Carvalho de Melo
  2021-03-28 17:17   ` Song Liu
  2 siblings, 2 replies; 16+ messages in thread
From: Song Liu @ 2021-03-25  0:55 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo



> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
>     64-bit array to attach given cgroups.  The first element is a
>     number of cgroups in the buffer, and the rest is a list of cgroup
>     ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
>     array to get the event counter values.  The first element is size
>     of the array in byte, and the second element is a cgroup id to
>     read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
> include/linux/perf_event.h      |  22 ++
> include/uapi/linux/perf_event.h |   2 +
> kernel/events/core.c            | 474 ++++++++++++++++++++++++++++++--
> 3 files changed, 471 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..2760f3b07534 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -771,6 +771,18 @@ struct perf_event {
> 
> #ifdef CONFIG_CGROUP_PERF
> 	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> +
> +	/* to share an event for multiple cgroups */
> +	struct hlist_head		*cgrp_node_hash;
> +	struct perf_cgroup_node		*cgrp_node_entries;
> +	int				nr_cgrp_nodes;
> +	int				cgrp_node_hash_bits;
> +
> +	struct list_head		cgrp_node_entry;
> +
> +	u64				cgrp_node_count;
> +	u64				cgrp_node_time_enabled;
> +	u64				cgrp_node_time_running;

A comment saying the above values are from previous reading would be helpful. 

> #endif
> 
> #ifdef CONFIG_SECURITY
> @@ -780,6 +792,14 @@ struct perf_event {
> #endif /* CONFIG_PERF_EVENTS */
> };
> 
> +struct perf_cgroup_node {
> +	struct hlist_node		node;
> +	u64				id;
> +	u64				count;
> +	u64				time_enabled;
> +	u64				time_running;
> +	u64				padding[2];

Do we really need the padding? For cache line alignment? 

> +};
> 
> struct perf_event_groups {
> 	struct rb_root	tree;
> @@ -843,6 +863,8 @@ struct perf_event_context {
> 	int				pin_count;
> #ifdef CONFIG_CGROUP_PERF
> 	int				nr_cgroups;	 /* cgroup evts */
> +	struct list_head		cgrp_node_list;
> +	struct list_head		cgrp_ctx_entry;
> #endif
> 	void				*task_ctx_data; /* pmu specific data */
> 	struct rcu_head			rcu_head;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index ad15e40d7f5d..06bc7ab13616 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> #define PERF_EVENT_IOC_PAUSE_OUTPUT		_IOW('$', 9, __u32)
> #define PERF_EVENT_IOC_QUERY_BPF		_IOWR('$', 10, struct perf_event_query_bpf *)
> #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES	_IOW('$', 11, struct perf_event_attr *)
> +#define PERF_EVENT_IOC_ATTACH_CGROUP		_IOW('$', 12, __u64 *)
> +#define PERF_EVENT_IOC_READ_CGROUP		_IOWR('$', 13, __u64 *)
> 
> enum perf_event_ioc_flags {
> 	PERF_IOC_FLAG_GROUP		= 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f07943183041..38c26a23418a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -379,6 +379,7 @@ enum event_type_t {
>  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
>  */
> 
> +static void perf_sched_enable(void);
> static void perf_sched_delayed(struct work_struct *work);
> DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> @@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
> 	return 1;
> }
> 
> +#ifdef CONFIG_CGROUP_PERF
> +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> +
> +static bool event_can_attach_cgroup(struct perf_event *event)
> +{
> +	if (is_sampling_event(event))
> +		return false;
> +	if (event->attach_state & PERF_ATTACH_TASK)
> +		return false;
> +	if (is_cgroup_event(event))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool event_has_cgroup_node(struct perf_event *event)
> +{
> +	return event->nr_cgrp_nodes > 0;
> +}
> +
> +static struct perf_cgroup_node *
> +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> +{
> +	struct perf_cgroup_node *cgrp_node;
> +	int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> +
> +	hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> +		if (cgrp_node->id == cgrp_id)
> +			return cgrp_node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> +	u64 delta_count, delta_time_enabled, delta_time_running;
> +	int i;
> +
> +	if (event->cgrp_node_count == 0)

Do you mean to use nr_cgrp_nodes above? 

> +		goto out;
> +
> +	delta_count = local64_read(&event->count) - event->cgrp_node_count;
> +	delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> +	delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> +
> +	/* account delta to all ancestor cgroups */
> +	for (i = 0; i <= cgrp->level; i++) {
> +		struct perf_cgroup_node *node;
> +
> +		node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> +		if (node) {
> +			node->count += delta_count;
> +			node->time_enabled += delta_time_enabled;
> +			node->time_running += delta_time_running;
> +		}
> +	}
> +
> +out:
> +	event->cgrp_node_count = local64_read(&event->count);
> +	event->cgrp_node_time_enabled = event->total_time_enabled;
> +	event->cgrp_node_time_running = event->total_time_running;
> +}
> +
> +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
> +		event->pmu->read(event);
> +
> +	perf_event_update_time(event);
> +	perf_update_cgroup_node(event, cgrp);
> +}
> +
> +/* this is called from context switch */
> +static void update_cgroup_node_events(struct perf_event_context *ctx,
> +				      struct cgroup *cgrp)
> +{
> +	struct perf_event *event;
> +
> +	lockdep_assert_held(&ctx->lock);
> +
> +	if (ctx->is_active & EVENT_TIME)
> +		update_context_time(ctx);
> +
> +	list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> +		update_cgroup_node(event, cgrp);
> +}
> +
> +static void cgroup_node_sched_out(struct task_struct *task)
> +{
> +	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> +	struct perf_event_context *ctx;
> +
> +	list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> +		raw_spin_lock(&ctx->lock);
> +		update_cgroup_node_events(ctx, cgrp->css.cgroup);
> +		raw_spin_unlock(&ctx->lock);
> +	}
> +}
> +
> +/* this is called from the when event is enabled/disabled */

I don't think we call this when the event is disabled. 

> +static void perf_add_cgrp_node_list(struct perf_event *event,
> +				    struct perf_event_context *ctx)
> +{
> +	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> +	bool is_first;
> +
> +	lockdep_assert_irqs_disabled();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	is_first = list_empty(&ctx->cgrp_node_list);
> +	list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
> +
> +	if (is_first)
> +		list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
> +
> +	update_cgroup_node(event, cgrp->css.cgroup);

Will this add some readings before PERF_EVENT_IOC_ATTACH_CGROUP to the counters?

> +
> }
> +
> +static void perf_del_cgrp_node_list(struct perf_event *event,
> +				    struct perf_event_context *ctx)
> +{
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> +
> +	lockdep_assert_irqs_disabled();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	update_cgroup_node(event, cgrp->css.cgroup);
> +	/* to refresh delta when it's enabled */
> +	event->cgrp_node_count = 0;
> +
> +	list_del(&event->cgrp_node_entry);
> +
> +	if (list_empty(&ctx->cgrp_node_list))
> +		list_del(&ctx->cgrp_ctx_entry);
> +}
> +
[...]
> +
> +/* this is called from ioctl */
> +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> +					 u64 *cgroup_ids)
> +{
> +	struct perf_cgroup_node *cgrp_node;
> +	struct perf_event_context *ctx = event->ctx;
> +	struct hlist_head *cgrp_node_hash;
> +	int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
> +	unsigned long flags;
> +	bool is_first = true;
> +	bool enabled;
> +	int i, nr_hash;
> +	int hash_bits;
> +
> +	if (nr_cgrps < MIN_CGRP_NODE_HASH)
> +		nr_hash = MIN_CGRP_NODE_HASH;
> +	else
> +		nr_hash = roundup_pow_of_two(nr_cgrps);
> +	hash_bits = ilog2(nr_hash);
> +
> +	cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> +				      GFP_KERNEL, node);
> +	if (cgrp_node_hash == NULL)
> +		return -ENOMEM;
> +
> +	cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> +	if (cgrp_node == NULL) {
> +		kfree(cgrp_node_hash);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < (int)nr_cgrps; i++) {
> +		int key = hash_64(cgroup_ids[i], hash_bits);
> +
> +		cgrp_node[i].id = cgroup_ids[i];
> +		hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> +	}
> +
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> +	enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> +
> +	if (event->nr_cgrp_nodes != 0) {
> +		kfree(event->cgrp_node_hash);
> +		kfree(event->cgrp_node_entries);
> +		is_first = false;
> +	}

To add another cgroup to the list, we use PERF_EVENT_IOC_ATTACH_CGROUP to 
do the whole list. So we may lost some readings during this, right?

> +
> +	event->cgrp_node_hash = cgrp_node_hash;
> +	event->cgrp_node_entries = cgrp_node;
> +	event->cgrp_node_hash_bits = hash_bits;
> +	event->nr_cgrp_nodes = nr_cgrps;
> +
> +	raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	if (is_first && enabled)
> +		event_function_call(event, perf_attach_cgroup_node, NULL);
> +
> +	return 0;
> +}

[...]
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-25  0:55   ` Song Liu
@ 2021-03-25  2:44     ` Namhyung Kim
  2021-03-25 12:57     ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-25  2:44 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo

Hi Song,

Thanks for your review!

On Thu, Mar 25, 2021 at 9:56 AM Song Liu <songliubraving@fb.com> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >     64-bit array to attach given cgroups.  The first element is a
> >     number of cgroups in the buffer, and the rest is a list of cgroup
> >     ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >     array to get the event counter values.  The first element is size
> >     of the array in byte, and the second element is a cgroup id to
> >     read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > include/linux/perf_event.h      |  22 ++
> > include/uapi/linux/perf_event.h |   2 +
> > kernel/events/core.c            | 474 ++++++++++++++++++++++++++++++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..2760f3b07534 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,18 @@ struct perf_event {
> >
> > #ifdef CONFIG_CGROUP_PERF
> >       struct perf_cgroup              *cgrp; /* cgroup event is attach to */
> > +
> > +     /* to share an event for multiple cgroups */
> > +     struct hlist_head               *cgrp_node_hash;
> > +     struct perf_cgroup_node         *cgrp_node_entries;
> > +     int                             nr_cgrp_nodes;
> > +     int                             cgrp_node_hash_bits;
> > +
> > +     struct list_head                cgrp_node_entry;
> > +
> > +     u64                             cgrp_node_count;
> > +     u64                             cgrp_node_time_enabled;
> > +     u64                             cgrp_node_time_running;
>
> A comment saying the above values are from previous reading would be helpful.

Sure, will add.

>
> > #endif
> >
> > #ifdef CONFIG_SECURITY
> > @@ -780,6 +792,14 @@ struct perf_event {
> > #endif /* CONFIG_PERF_EVENTS */
> > };
> >
> > +struct perf_cgroup_node {
> > +     struct hlist_node               node;
> > +     u64                             id;
> > +     u64                             count;
> > +     u64                             time_enabled;
> > +     u64                             time_running;
> > +     u64                             padding[2];
>
> Do we really need the padding? For cache line alignment?

Yeah I was thinking about it.  It seems I need to use the
___cacheline_aligned macro instead.

>
> > +};
> >
> > struct perf_event_groups {
> >       struct rb_root  tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> >       int                             pin_count;
> > #ifdef CONFIG_CGROUP_PERF
> >       int                             nr_cgroups;      /* cgroup evts */
> > +     struct list_head                cgrp_node_list;
> > +     struct list_head                cgrp_ctx_entry;
> > #endif
> >       void                            *task_ctx_data; /* pmu specific data */
> >       struct rcu_head                 rcu_head;
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index ad15e40d7f5d..06bc7ab13616 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> > #define PERF_EVENT_IOC_PAUSE_OUTPUT           _IOW('$', 9, __u32)
> > #define PERF_EVENT_IOC_QUERY_BPF              _IOWR('$', 10, struct perf_event_query_bpf *)
> > #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES      _IOW('$', 11, struct perf_event_attr *)
> > +#define PERF_EVENT_IOC_ATTACH_CGROUP         _IOW('$', 12, __u64 *)
> > +#define PERF_EVENT_IOC_READ_CGROUP           _IOWR('$', 13, __u64 *)
> >
> > enum perf_event_ioc_flags {
> >       PERF_IOC_FLAG_GROUP             = 1U << 0,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f07943183041..38c26a23418a 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -379,6 +379,7 @@ enum event_type_t {
> >  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> >  */
> >
> > +static void perf_sched_enable(void);
> > static void perf_sched_delayed(struct work_struct *work);
> > DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> > static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> > @@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
> >       return 1;
> > }
> >
> > +#ifdef CONFIG_CGROUP_PERF
> > +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> > +
> > +static bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > +     if (is_sampling_event(event))
> > +             return false;
> > +     if (event->attach_state & PERF_ATTACH_TASK)
> > +             return false;
> > +     if (is_cgroup_event(event))
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> > +static bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > +     return event->nr_cgrp_nodes > 0;
> > +}
> > +
> > +static struct perf_cgroup_node *
> > +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> > +
> > +     hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> > +             if (cgrp_node->id == cgrp_id)
> > +                     return cgrp_node;
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     u64 delta_count, delta_time_enabled, delta_time_running;
> > +     int i;
> > +
> > +     if (event->cgrp_node_count == 0)
>
> Do you mean to use nr_cgrp_nodes above?

No, this is to calculate delta so it needs to be set first.
If it's the first call, it just updates the count and time and
skips the delta accounting.

>
> > +             goto out;
> > +
> > +     delta_count = local64_read(&event->count) - event->cgrp_node_count;
> > +     delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > +     delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > +
> > +     /* account delta to all ancestor cgroups */
> > +     for (i = 0; i <= cgrp->level; i++) {
> > +             struct perf_cgroup_node *node;
> > +
> > +             node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > +             if (node) {
> > +                     node->count += delta_count;
> > +                     node->time_enabled += delta_time_enabled;
> > +                     node->time_running += delta_time_running;
> > +             }
> > +     }
> > +
> > +out:
> > +     event->cgrp_node_count = local64_read(&event->count);
> > +     event->cgrp_node_time_enabled = event->total_time_enabled;
> > +     event->cgrp_node_time_running = event->total_time_running;
> > +}
> > +
> > +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     if (event->state == PERF_EVENT_STATE_ACTIVE)
> > +             event->pmu->read(event);
> > +
> > +     perf_event_update_time(event);
> > +     perf_update_cgroup_node(event, cgrp);
> > +}
> > +
> > +/* this is called from context switch */
> > +static void update_cgroup_node_events(struct perf_event_context *ctx,
> > +                                   struct cgroup *cgrp)
> > +{
> > +     struct perf_event *event;
> > +
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     if (ctx->is_active & EVENT_TIME)
> > +             update_context_time(ctx);
> > +
> > +     list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> > +             update_cgroup_node(event, cgrp);
> > +}
> > +
> > +static void cgroup_node_sched_out(struct task_struct *task)
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> > +     struct perf_event_context *ctx;
> > +
> > +     list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> > +             raw_spin_lock(&ctx->lock);
> > +             update_cgroup_node_events(ctx, cgrp->css.cgroup);
> > +             raw_spin_unlock(&ctx->lock);
> > +     }
> > +}
> > +
> > +/* this is called from the when event is enabled/disabled */
>
> I don't think we call this when the event is disabled.

Oh, sorry.  I meant 'add' for enable, 'del' for disable..
Maybe I can change it to 'these are called from ...'.

>
> > +static void perf_add_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +     bool is_first;
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     is_first = list_empty(&ctx->cgrp_node_list);
> > +     list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
> > +
> > +     if (is_first)
> > +             list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
>
> Will this add some readings before PERF_EVENT_IOC_ATTACH_CGROUP to the counters?

At this moment, the event is just enabled so the cgrp_node_count
is 0 like I said above.  So it'll update the timestamp and count in
the event but won't update the cgroup nodes.

>
> > +
> > }
> > +
> > +static void perf_del_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
> > +     /* to refresh delta when it's enabled */
> > +     event->cgrp_node_count = 0;
> > +
> > +     list_del(&event->cgrp_node_entry);
> > +
> > +     if (list_empty(&ctx->cgrp_node_list))
> > +             list_del(&ctx->cgrp_ctx_entry);
> > +}
> > +
> [...]
> > +
> > +/* this is called from ioctl */
> > +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> > +                                      u64 *cgroup_ids)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     struct hlist_head *cgrp_node_hash;
> > +     int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
> > +     unsigned long flags;
> > +     bool is_first = true;
> > +     bool enabled;
> > +     int i, nr_hash;
> > +     int hash_bits;
> > +
> > +     if (nr_cgrps < MIN_CGRP_NODE_HASH)
> > +             nr_hash = MIN_CGRP_NODE_HASH;
> > +     else
> > +             nr_hash = roundup_pow_of_two(nr_cgrps);
> > +     hash_bits = ilog2(nr_hash);
> > +
> > +     cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> > +                                   GFP_KERNEL, node);
> > +     if (cgrp_node_hash == NULL)
> > +             return -ENOMEM;
> > +
> > +     cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> > +     if (cgrp_node == NULL) {
> > +             kfree(cgrp_node_hash);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     for (i = 0; i < (int)nr_cgrps; i++) {
> > +             int key = hash_64(cgroup_ids[i], hash_bits);
> > +
> > +             cgrp_node[i].id = cgroup_ids[i];
> > +             hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> > +     }
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> > +
> > +     if (event->nr_cgrp_nodes != 0) {
> > +             kfree(event->cgrp_node_hash);
> > +             kfree(event->cgrp_node_entries);
> > +             is_first = false;
> > +     }
>
> To add another cgroup to the list, we use PERF_EVENT_IOC_ATTACH_CGROUP to
> do the whole list. So we may lost some readings during this, right?

So the basic use case is perf stat which can have a list of all
cgroups to measure
when it calls the ioctl.  Then it creates all nodes in the table at
once and sets it.

If someone wants to measure more cgroups, [s]he can call the ioctl again with
the update cgroup list (original + new).

Thanks,
Namhyung

>
> > +
> > +     event->cgrp_node_hash = cgrp_node_hash;
> > +     event->cgrp_node_entries = cgrp_node;
> > +     event->cgrp_node_hash_bits = hash_bits;
> > +     event->nr_cgrp_nodes = nr_cgrps;
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     if (is_first && enabled)
> > +             event_function_call(event, perf_attach_cgroup_node, NULL);
> > +
> > +     return 0;
> > +}
>
> [...]
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-25  0:55   ` Song Liu
  2021-03-25  2:44     ` Namhyung Kim
@ 2021-03-25 12:57     ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 16+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-25 12:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo

Em Thu, Mar 25, 2021 at 12:55:50AM +0000, Song Liu escreveu:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> > #ifdef CONFIG_SECURITY
> > @@ -780,6 +792,14 @@ struct perf_event {
> > #endif /* CONFIG_PERF_EVENTS */
> > };

> > +struct perf_cgroup_node {
> > +	struct hlist_node		node;
> > +	u64				id;
> > +	u64				count;
> > +	u64				time_enabled;
> > +	u64				time_running;
> > +	u64				padding[2];
> 
> Do we really need the padding? For cache line alignment? 

I guess so, to get it to 64 bytes, then having it as:

struct perf_cgroup_node {
	struct hlist_node		node;
	u64				id;
	u64				count;
	u64				time_enabled;
	u64				time_running;
} ____cacheline_aligned;

Seems better :-)

Testing:

[acme@five c]$ cat cacheline_aligned.c
#ifndef ____cacheline_aligned
#define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
#endif

// from ../build/v5.12.0-rc4+/include/generated/autoconf.h
#define CONFIG_X86_L1_CACHE_SHIFT 6

#define L1_CACHE_SHIFT  (CONFIG_X86_L1_CACHE_SHIFT)
#define L1_CACHE_BYTES  (1 << L1_CACHE_SHIFT)

#ifndef SMP_CACHE_BYTES
#define SMP_CACHE_BYTES L1_CACHE_BYTES
#endif

typedef long long unsigned int u64;

struct hlist_node {
	struct hlist_node *        next;                 /*     0     8 */
	struct hlist_node * *      pprev;                /*     8     8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};

struct perf_cgroup_node {
        struct hlist_node               node;
        u64                             id;
        u64                             count;
        u64                             time_enabled;
        u64                             time_running;
} ____cacheline_aligned foo;

[acme@five c]$ cc  -g  -c -o cacheline_aligned.o cacheline_aligned.c
[acme@five c]$ pahole cacheline_aligned.o
struct hlist_node {
	struct hlist_node *        next;                 /*     0     8 */
	struct hlist_node * *      pprev;                /*     8     8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};
struct perf_cgroup_node {
	struct hlist_node          node;                 /*     0    16 */
	u64                        id;                   /*    16     8 */
	u64                        count;                /*    24     8 */
	u64                        time_enabled;         /*    32     8 */
	u64                        time_running;         /*    40     8 */

	/* size: 64, cachelines: 1, members: 5 */
	/* padding: 16 */
} __attribute__((__aligned__(64)));
[acme@five c]$

- Arnaldo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
  2021-03-24  0:30   ` Song Liu
  2021-03-25  0:55   ` Song Liu
@ 2021-03-28 17:17   ` Song Liu
  2021-03-29 11:33     ` Namhyung Kim
  2 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-03-28 17:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo



> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
>     64-bit array to attach given cgroups.  The first element is a
>     number of cgroups in the buffer, and the rest is a list of cgroup
>     ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
>     array to get the event counter values.  The first element is size
>     of the array in byte, and the second element is a cgroup id to
>     read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
> include/linux/perf_event.h      |  22 ++
> include/uapi/linux/perf_event.h |   2 +
> kernel/events/core.c            | 474 ++++++++++++++++++++++++++++++--
> 3 files changed, 471 insertions(+), 27 deletions(-)

[...]

> @@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
> 	INIT_LIST_HEAD(&ctx->event_list);
> 	INIT_LIST_HEAD(&ctx->pinned_active);
> 	INIT_LIST_HEAD(&ctx->flexible_active);
> +	INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> +	INIT_LIST_HEAD(&ctx->cgrp_node_list);

I guess we need ifdef CONFIG_CGROUP_PERF here?

> 	refcount_set(&ctx->refcount, 1);
> }
> 
> @@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
> 	if (is_cgroup_event(event))
> 		perf_detach_cgroup(event);
> 
> +	perf_event_destroy_cgroup_nodes(event);
> +
> 	if (!event->parent) {
> 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> 			put_callchain_buffers();

[...]

> +static void perf_sched_enable(void)
> +{
> +	/*
> +	 * We need the mutex here because static_branch_enable()
> +	 * must complete *before* the perf_sched_count increment
> +	 * becomes visible.
> +	 */
> +	if (atomic_inc_not_zero(&perf_sched_count))
> +		return;

Why don't we use perf_cgroup_events for the new use case? 

> +
> +	mutex_lock(&perf_sched_mutex);
> +	if (!atomic_read(&perf_sched_count)) {
> +		static_branch_enable(&perf_sched_events);
> +		/*
> +		 * Guarantee that all CPUs observe they key change and
> +		 * call the perf scheduling hooks before proceeding to
> +		 * install events that need them.
> +		 */
> +		synchronize_rcu();
> +	}
> +	/*
> +	 * Now that we have waited for the sync_sched(), allow further
> +	 * increments to by-pass the mutex.
> +	 */
> +	atomic_inc(&perf_sched_count);
> +	mutex_unlock(&perf_sched_mutex);
> +}
> +
> static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> {
> 	struct perf_event *event = file->private_data;
> 	struct perf_event_context *ctx;
> +	bool do_sched_enable = false;
> 	long ret;
> 
> 	/* Treat ioctl like writes as it is likely a mutating operation. */
> @@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> 		return ret;
> 
> 	ctx = perf_event_ctx_lock(event);
> +	/* ATTACH_CGROUP requires context switch callback */
> +	if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> +		do_sched_enable = true;
> 	ret = _perf_ioctl(event, cmd, arg);
> 	perf_event_ctx_unlock(event, ctx);
> 
> +	/*
> +	 * Due to the circular lock dependency, it cannot call
> +	 * static_branch_enable() under the ctx->mutex.
> +	 */
> +	if (do_sched_enable && ret >= 0)
> +		perf_sched_enable();
> +
> 	return ret;
> }
> 
> @@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
> 	if (event->attr.text_poke)
> 		atomic_inc(&nr_text_poke_events);
> 
> -	if (inc) {
> -		/*
> -		 * We need the mutex here because static_branch_enable()
> -		 * must complete *before* the perf_sched_count increment
> -		 * becomes visible.
> -		 */
> -		if (atomic_inc_not_zero(&perf_sched_count))
> -			goto enabled;
> -
> -		mutex_lock(&perf_sched_mutex);
> -		if (!atomic_read(&perf_sched_count)) {
> -			static_branch_enable(&perf_sched_events);
> -			/*
> -			 * Guarantee that all CPUs observe they key change and
> -			 * call the perf scheduling hooks before proceeding to
> -			 * install events that need them.
> -			 */
> -			synchronize_rcu();
> -		}
> -		/*
> -		 * Now that we have waited for the sync_sched(), allow further
> -		 * increments to by-pass the mutex.
> -		 */
> -		atomic_inc(&perf_sched_count);
> -		mutex_unlock(&perf_sched_mutex);
> -	}
> -enabled:
> +	if (inc)
> +		perf_sched_enable();
> 
> 	account_event_cpu(event, event->cpu);
> 
> @@ -13008,6 +13404,7 @@ static void __init perf_event_init_all_cpus(void)
> 
> #ifdef CONFIG_CGROUP_PERF
> 		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> +		INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
> #endif
> 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
> 	}
> @@ -13218,6 +13615,28 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
> 	return 0;
> }
> 
> +static int __perf_cgroup_update_node(void *info)
> +{
> +	struct task_struct *task = info;
> +
> +	rcu_read_lock();
> +	cgroup_node_sched_out(task);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
> +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> +{
> +	struct task_struct *task;
> +	struct cgroup_subsys_state *css;
> +
> +	cgroup_taskset_for_each(task, css, tset)
> +		task_function_call(task, __perf_cgroup_update_node, task);
> +
> +	return 0;
> +}

Could you please explain why we need this logic in can_attach? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups
  2021-03-23 16:21 ` [PATCH 2/2] perf/core: Support reading group events with shared cgroups Namhyung Kim
@ 2021-03-28 17:31   ` Song Liu
  2021-03-29 11:36     ` Namhyung Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-03-28 17:31 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers



> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> This enables reading event group's counter values together with a
> PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> Users should give a correct size of buffer to be read.
> 
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
> kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 116 insertions(+), 3 deletions(-)
> 

[...]

> +}
> +
> +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
> +					   char __user *buf)
> +{
> +	struct perf_cgroup_node *cgrp;
> +	struct perf_event_context *ctx = event->ctx;
> +	struct perf_event *sibling;
> +	u64 read_format = event->attr.read_format;
> +	unsigned long flags;
> +	u64 *values;
> +	int n = 1;
> +	int ret;
> +
> +	values = kzalloc(event->read_size, GFP_KERNEL);
> +	if (!values)
> +		return -ENOMEM;
> +
> +	values[0] = 1 + event->nr_siblings;
> +
> +	/* update event count and times (possibly run on other cpu) */
> +	(void)perf_event_read(event, true);
> +
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> +	cgrp = find_cgroup_node(event, cgrp_id);
> +	if (cgrp == NULL) {
> +		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		kfree(values);
> +		return -ENOENT;
> +	}
> +
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> +		values[n++] = cgrp->time_enabled;
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> +		values[n++] = cgrp->time_running;
> +
> +	values[n++] = cgrp->count;
> +	if (read_format & PERF_FORMAT_ID)
> +		values[n++] = primary_event_id(event);
> +
> +	for_each_sibling_event(sibling, event) {
> +		n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> +						       cgrp_id, &values[n]);
> +	}
> +
> +	raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	ret = copy_to_user(buf, values, n * sizeof(u64));
> +	kfree(values);
> +	if (ret)
> +		return -EFAULT;
> +
> +	return n * sizeof(u64);
> +}
> +
> +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> +				       u64 cgrp_id, char __user *buf)
> +{
> +	u64 read_format = event->attr.read_format;
> +
> +	if (read_size < event->read_size + 2 * sizeof(u64))

Why do we need read_size + 2 u64 here? 

Thanks,
Song

[...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-28 17:17   ` Song Liu
@ 2021-03-29 11:33     ` Namhyung Kim
  2021-03-30  6:33       ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Namhyung Kim @ 2021-03-29 11:33 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo

On Mon, Mar 29, 2021 at 2:17 AM Song Liu <songliubraving@fb.com> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >     64-bit array to attach given cgroups.  The first element is a
> >     number of cgroups in the buffer, and the rest is a list of cgroup
> >     ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >     array to get the event counter values.  The first element is size
> >     of the array in byte, and the second element is a cgroup id to
> >     read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > include/linux/perf_event.h      |  22 ++
> > include/uapi/linux/perf_event.h |   2 +
> > kernel/events/core.c            | 474 ++++++++++++++++++++++++++++++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
>
> [...]
>
> > @@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
> >       INIT_LIST_HEAD(&ctx->event_list);
> >       INIT_LIST_HEAD(&ctx->pinned_active);
> >       INIT_LIST_HEAD(&ctx->flexible_active);
> > +     INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> > +     INIT_LIST_HEAD(&ctx->cgrp_node_list);
>
> I guess we need ifdef CONFIG_CGROUP_PERF here?

Correct.  Thanks for pointing that out.

>
> >       refcount_set(&ctx->refcount, 1);
> > }
> >
> > @@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
> >       if (is_cgroup_event(event))
> >               perf_detach_cgroup(event);
> >
> > +     perf_event_destroy_cgroup_nodes(event);
> > +
> >       if (!event->parent) {
> >               if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> >                       put_callchain_buffers();
>
> [...]
>
> > +static void perf_sched_enable(void)
> > +{
> > +     /*
> > +      * We need the mutex here because static_branch_enable()
> > +      * must complete *before* the perf_sched_count increment
> > +      * becomes visible.
> > +      */
> > +     if (atomic_inc_not_zero(&perf_sched_count))
> > +             return;
>
> Why don't we use perf_cgroup_events for the new use case?

Maybe.. The two methods are mutually exclusive and I think
this will be preferred in the future due to the lower overhead.
And I'd like to separate it from the existing code to avoid
possible confusions.

For the perf_sched_enable(), the difference between the
existing cgroup events and this approach is when it calls
the function above.  Usually it calls during account_event()
which is a part of the event initialization.  But this approach
calls the function after an event is created.  That's why I
have the do_sched_enable variable in the perf_ioctl below
to ensure it's called exactly once for each event.


>
> > +
> > +     mutex_lock(&perf_sched_mutex);
> > +     if (!atomic_read(&perf_sched_count)) {
> > +             static_branch_enable(&perf_sched_events);
> > +             /*
> > +              * Guarantee that all CPUs observe they key change and
> > +              * call the perf scheduling hooks before proceeding to
> > +              * install events that need them.
> > +              */
> > +             synchronize_rcu();
> > +     }
> > +     /*
> > +      * Now that we have waited for the sync_sched(), allow further
> > +      * increments to by-pass the mutex.
> > +      */
> > +     atomic_inc(&perf_sched_count);
> > +     mutex_unlock(&perf_sched_mutex);
> > +}
> > +
> > static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > {
> >       struct perf_event *event = file->private_data;
> >       struct perf_event_context *ctx;
> > +     bool do_sched_enable = false;
> >       long ret;
> >
> >       /* Treat ioctl like writes as it is likely a mutating operation. */
> > @@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >               return ret;
> >
> >       ctx = perf_event_ctx_lock(event);
> > +     /* ATTACH_CGROUP requires context switch callback */
> > +     if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> > +             do_sched_enable = true;
> >       ret = _perf_ioctl(event, cmd, arg);
> >       perf_event_ctx_unlock(event, ctx);
> >
> > +     /*
> > +      * Due to the circular lock dependency, it cannot call
> > +      * static_branch_enable() under the ctx->mutex.
> > +      */
> > +     if (do_sched_enable && ret >= 0)
> > +             perf_sched_enable();
> > +
> >       return ret;
> > }
> >
> > @@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
> >       if (event->attr.text_poke)
> >               atomic_inc(&nr_text_poke_events);
> >
> > -     if (inc) {
> > -             /*
> > -              * We need the mutex here because static_branch_enable()
> > -              * must complete *before* the perf_sched_count increment
> > -              * becomes visible.
> > -              */
> > -             if (atomic_inc_not_zero(&perf_sched_count))
> > -                     goto enabled;
> > -
> > -             mutex_lock(&perf_sched_mutex);
> > -             if (!atomic_read(&perf_sched_count)) {
> > -                     static_branch_enable(&perf_sched_events);
> > -                     /*
> > -                      * Guarantee that all CPUs observe they key change and
> > -                      * call the perf scheduling hooks before proceeding to
> > -                      * install events that need them.
> > -                      */
> > -                     synchronize_rcu();
> > -             }
> > -             /*
> > -              * Now that we have waited for the sync_sched(), allow further
> > -              * increments to by-pass the mutex.
> > -              */
> > -             atomic_inc(&perf_sched_count);
> > -             mutex_unlock(&perf_sched_mutex);
> > -     }
> > -enabled:
> > +     if (inc)
> > +             perf_sched_enable();
> >
> >       account_event_cpu(event, event->cpu);
> >
> > @@ -13008,6 +13404,7 @@ static void __init perf_event_init_all_cpus(void)
> >
> > #ifdef CONFIG_CGROUP_PERF
> >               INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> > +             INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
> > #endif
> >               INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
> >       }
> > @@ -13218,6 +13615,28 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
> >       return 0;
> > }
> >
> > +static int __perf_cgroup_update_node(void *info)
> > +{
> > +     struct task_struct *task = info;
> > +
> > +     rcu_read_lock();
> > +     cgroup_node_sched_out(task);
> > +     rcu_read_unlock();
> > +
> > +     return 0;
> > +}
> > +
> > +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> > +{
> > +     struct task_struct *task;
> > +     struct cgroup_subsys_state *css;
> > +
> > +     cgroup_taskset_for_each(task, css, tset)
> > +             task_function_call(task, __perf_cgroup_update_node, task);
> > +
> > +     return 0;
> > +}
>
> Could you please explain why we need this logic in can_attach?

IIUC the ss->attach() is called after a task's cgroup membership
is changed.  But we want to collect the performance numbers for
the old cgroup just before the change.  As the logic merely checks
the current task's cgroup, it should be done in the can_attach()
which is called before the cgroup change.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups
  2021-03-28 17:31   ` Song Liu
@ 2021-03-29 11:36     ` Namhyung Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-29 11:36 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers

On Mon, Mar 29, 2021 at 2:32 AM Song Liu <songliubraving@fb.com> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > This enables reading event group's counter values together with a
> > PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> > Users should give a correct size of buffer to be read.
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 116 insertions(+), 3 deletions(-)
> >
>
> [...]
>
> > +}
> > +
> > +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
> > +                                        char __user *buf)
> > +{
> > +     struct perf_cgroup_node *cgrp;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     struct perf_event *sibling;
> > +     u64 read_format = event->attr.read_format;
> > +     unsigned long flags;
> > +     u64 *values;
> > +     int n = 1;
> > +     int ret;
> > +
> > +     values = kzalloc(event->read_size, GFP_KERNEL);
> > +     if (!values)
> > +             return -ENOMEM;
> > +
> > +     values[0] = 1 + event->nr_siblings;
> > +
> > +     /* update event count and times (possibly run on other cpu) */
> > +     (void)perf_event_read(event, true);
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     cgrp = find_cgroup_node(event, cgrp_id);
> > +     if (cgrp == NULL) {
> > +             raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +             kfree(values);
> > +             return -ENOENT;
> > +     }
> > +
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> > +             values[n++] = cgrp->time_enabled;
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> > +             values[n++] = cgrp->time_running;
> > +
> > +     values[n++] = cgrp->count;
> > +     if (read_format & PERF_FORMAT_ID)
> > +             values[n++] = primary_event_id(event);
> > +
> > +     for_each_sibling_event(sibling, event) {
> > +             n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> > +                                                    cgrp_id, &values[n]);
> > +     }
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     ret = copy_to_user(buf, values, n * sizeof(u64));
> > +     kfree(values);
> > +     if (ret)
> > +             return -EFAULT;
> > +
> > +     return n * sizeof(u64);
> > +}
> > +
> > +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> > +                                    u64 cgrp_id, char __user *buf)
> > +{
> > +     u64 read_format = event->attr.read_format;
> > +
> > +     if (read_size < event->read_size + 2 * sizeof(u64))
>
> Why do we need read_size + 2 u64 here?

I should've repeated the following description in the patch 1.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
     array to get the event counter values.  The first element is size
     of the array in byte, and the second element is a cgroup id to
     read.  The rest is to save the counter value and timings.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-29 11:33     ` Namhyung Kim
@ 2021-03-30  6:33       ` Song Liu
  2021-03-30 15:11         ` Namhyung Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2021-03-30  6:33 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo



> On Mar 29, 2021, at 4:33 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Mon, Mar 29, 2021 at 2:17 AM Song Liu <songliubraving@fb.com> wrote:
>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> 
>>> As we can run many jobs (in container) on a big machine, we want to
>>> measure each job's performance during the run.  To do that, the
>>> perf_event can be associated to a cgroup to measure it only.
>>> 

[...]

>>> +     return 0;
>>> +}
>> 
>> Could you please explain why we need this logic in can_attach?
> 
> IIUC the ss->attach() is called after a task's cgroup membership
> is changed.  But we want to collect the performance numbers for
> the old cgroup just before the change.  As the logic merely checks
> the current task's cgroup, it should be done in the can_attach()
> which is called before the cgroup change.

Thanks for the explanations. 

Overall, I really like the core idea, especially that the overhead on 
context switch is bounded (by the depth of cgroup tree). 

Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible? 
Specifically, if we can have
 
  PERF_EVENT_IOC_ADD_CGROUP	add a cgroup to the list 
  PERF_EVENT_IOC_EL_CGROUP	delete a cgroup from the list

we can probably share these events among multiple processes, and 
these processes don't need to know others' cgroup list. I think 
this will be useful for users to build customized monitoring in 
its own container. 

Does this make sense?

Thanks,
Song


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-30  6:33       ` Song Liu
@ 2021-03-30 15:11         ` Namhyung Kim
  2021-04-01  6:00           ` Stephane Eranian
  2021-04-01  6:19           ` Song Liu
  0 siblings, 2 replies; 16+ messages in thread
From: Namhyung Kim @ 2021-03-30 15:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo

On Tue, Mar 30, 2021 at 3:33 PM Song Liu <songliubraving@fb.com> wrote:
> > On Mar 29, 2021, at 4:33 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Mon, Mar 29, 2021 at 2:17 AM Song Liu <songliubraving@fb.com> wrote:
> >>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >>>
> >>> As we can run many jobs (in container) on a big machine, we want to
> >>> measure each job's performance during the run.  To do that, the
> >>> perf_event can be associated to a cgroup to measure it only.
> >>>
>
> [...]
>
> >>> +     return 0;
> >>> +}
> >>
> >> Could you please explain why we need this logic in can_attach?
> >
> > IIUC the ss->attach() is called after a task's cgroup membership
> > is changed.  But we want to collect the performance numbers for
> > the old cgroup just before the change.  As the logic merely checks
> > the current task's cgroup, it should be done in the can_attach()
> > which is called before the cgroup change.
>
> Thanks for the explanations.
>
> Overall, I really like the core idea, especially that the overhead on
> context switch is bounded (by the depth of cgroup tree).

Thanks!

>
> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
> Specifically, if we can have
>
>   PERF_EVENT_IOC_ADD_CGROUP     add a cgroup to the list
>   PERF_EVENT_IOC_EL_CGROUP      delete a cgroup from the list
>
> we can probably share these events among multiple processes, and
> these processes don't need to know others' cgroup list. I think
> this will be useful for users to build customized monitoring in
> its own container.
>
> Does this make sense?

Maybe we can add ADD/DEL interface for more flexible monitoring
but I'm not sure which use cases it'll be used actually.

For your multi-process sharing case, the original events' file
descriptors should be shared first.  Also adding and deleting
(or just reading) arbitrary cgroups from a container can be a
security concern IMHO.

So I just focused on the single-process multi-cgroup case which is
already used (perf stat --for-each-cgroup) and very important in my
company's setup.  In this case we have a list of interested cgroups
from the beginning so it's more efficient to create a properly sized
hash table and all the nodes at once.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-30 15:11         ` Namhyung Kim
@ 2021-04-01  6:00           ` Stephane Eranian
  2021-04-01  6:19           ` Song Liu
  1 sibling, 0 replies; 16+ messages in thread
From: Stephane Eranian @ 2021-04-01  6:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo,
	Jiri Olsa, Mark Rutland, Alexander Shishkin, LKML, Andi Kleen,
	Ian Rogers, Tejun Heo

Hi,

I would like to re-emphasize why this patch is important. As Namhyung
outlined in his cover message,
cgroup monitoring build on top of per-cpu monitoring and offers
maximum flexibility by allowing each event
to be attached to a single cgroup. Although this is fine when the
machines were much smaller and the number
of simultaneous cgroups was also small, it does not work anymore with
today's machines and even less with future
machines.  Over the last couple of years, we have tried to make cgroup
monitoring more scalable. Ian  Rogers
patch series addressed the RB-tree handling of the event to avoid
walking the whole tree to find events from the
sched in cgroup. This helped reduce some of the overhead we are seeing
and which is causing serious problems
to our end users, forcing them to tone down monitoring and slice
collection over cgroup over time which is far from
ideal.

Namhyung's series goes a lot further, by addressing two key overhead factors:
  1- the file descriptor consumption explosion
  2- the context switch overhead

Again this is a major cause of problems for us and needed to be
addressed in a way that maintained backward compatibility.
We are interested in the case where the same events are measured
across all cgroups and I believe this is a common usage
model.

1/ File descriptor issue

With the current interface, if you want to monitor 10 events on a 112
CPU server across 200 cgroups, you need:

    num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors

A usual Linux distribution allows around 1024. Although if you are
root, you could increase the limit, this has some other impact to the
system: the memory footprint in kernel memory to back these file
descriptors and struct perf_event is large (see our presentation at
LPC2019).

2/ Context switch overhead

Each time you have a cgroup switch, i.e., a context switch where you
switch cgroups, then you incur a PMU event reschedule. A cgroup sched
in
is more expensive than a per-process sched in because you have to find
the events which are relevant to the next cgroup, i.e., you may have
to
walk more entries in the RB-tree. If the events are identical across
cgroups, you may end up paying that cost to reinstall the same events
(ignoring
multiplexing).
Furthermore, event scheduling is an expensive operation because of
memory access and PMU register accesses. It is always best, if it can
be avoided.
From our experience, that has caused significant overhead in our
systems to the point where we have to reduce the interval at which we
collect the data
and the number of cgroups we can monitor at once.


3/ Namhyung's solution

I personally like Namhyung's solution to the problem because it fits
within the interface, does not break existing per-cgroup mode. The
implementation is fairly
simple and non-invasive. It provides a very significant reduction of
overhead on BOTH the file descriptor pressure and context switch
overheads. It matches perfectly
with the common usage model of monitoring the same events across
multiple cgroups simultaneously. The patch does not disrupt existing
perf_event_open() or
read()/close() syscalls. Everything is handled via a pair of new ioctl().

It eliminates the file descriptor overhead as follows using the same
example as before:

Before:
    num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors
After:
    num_fds = num_events x num_cpus = 10 x 112 = 1120 descriptors
(200x reduction in fds and the memory savings that go with that in the
kernel)

In other words, it reduces the file descriptor consumption to what is
necessary for plain system-wide monitoring.

On context switch, the kernel computes the event delta and stores into
a hash table, i.e., a single PMU register access instead of the full
PMU rescheduling.
The delta is propagated to the proper cgroup hierarchy if needed.

The change is generic and benefits ALL processor architectures in the
same manner.

We have tested the patch on our servers with large configurations and
it has delivered significant savings and enabled monitoring of more
cgroups simultaneously
instead of monitoring in batches which never yielded a consistent view
of the system.

Furthermore, the patches could be extended to add, as Song Lu
suggested, the possibility of deleting cgroups attached to an event to
allow continuous monitoring
without having to restart the monitoring tool. I believe the extension
can be further improved by allowing a vector read of the counts as
well. That would eliminate a
significant number of ioctl(READ) syscalls.

Overall, I think this patch series delivers significant value-add to
the perf_events interface and should be committed ASAP.

Thanks.




On Tue, Mar 30, 2021 at 8:11 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu <songliubraving@fb.com> wrote:
> > > On Mar 29, 2021, at 4:33 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Mon, Mar 29, 2021 at 2:17 AM Song Liu <songliubraving@fb.com> wrote:
> > >>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> > >>>
> > >>> As we can run many jobs (in container) on a big machine, we want to
> > >>> measure each job's performance during the run.  To do that, the
> > >>> perf_event can be associated to a cgroup to measure it only.
> > >>>
> >
> > [...]
> >
> > >>> +     return 0;
> > >>> +}
> > >>
> > >> Could you please explain why we need this logic in can_attach?
> > >
> > > IIUC the ss->attach() is called after a task's cgroup membership
> > > is changed.  But we want to collect the performance numbers for
> > > the old cgroup just before the change.  As the logic merely checks
> > > the current task's cgroup, it should be done in the can_attach()
> > > which is called before the cgroup change.
> >
> > Thanks for the explanations.
> >
> > Overall, I really like the core idea, especially that the overhead on
> > context switch is bounded (by the depth of cgroup tree).
>
> Thanks!
>
> >
> > Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
> > Specifically, if we can have
> >
> >   PERF_EVENT_IOC_ADD_CGROUP     add a cgroup to the list
> >   PERF_EVENT_IOC_EL_CGROUP      delete a cgroup from the list
> >
> > we can probably share these events among multiple processes, and
> > these processes don't need to know others' cgroup list. I think
> > this will be useful for users to build customized monitoring in
> > its own container.
> >
> > Does this make sense?
>
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
>
> For your multi-process sharing case, the original events' file
> descriptors should be shared first.  Also adding and deleting
> (or just reading) arbitrary cgroups from a container can be a
> security concern IMHO.
>
> So I just focused on the single-process multi-cgroup case which is
> already used (perf stat --for-each-cgroup) and very important in my
> company's setup.  In this case we have a list of interested cgroups
> from the beginning so it's more efficient to create a properly sized
> hash table and all the nodes at once.
>
> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups
  2021-03-30 15:11         ` Namhyung Kim
  2021-04-01  6:00           ` Stephane Eranian
@ 2021-04-01  6:19           ` Song Liu
  1 sibling, 0 replies; 16+ messages in thread
From: Song Liu @ 2021-04-01  6:19 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Tejun Heo



> On Mar 30, 2021, at 8:11 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu <songliubraving@fb.com> wrote:
>>> On Mar 29, 2021, at 4:33 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> 
>>> On Mon, Mar 29, 2021 at 2:17 AM Song Liu <songliubraving@fb.com> wrote:
>>>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>>>> 
>>>>> As we can run many jobs (in container) on a big machine, we want to
>>>>> measure each job's performance during the run.  To do that, the
>>>>> perf_event can be associated to a cgroup to measure it only.
>>>>> 
>> 
>> [...]
>> 
>>>>> +     return 0;
>>>>> +}
>>>> 
>>>> Could you please explain why we need this logic in can_attach?
>>> 
>>> IIUC the ss->attach() is called after a task's cgroup membership
>>> is changed.  But we want to collect the performance numbers for
>>> the old cgroup just before the change.  As the logic merely checks
>>> the current task's cgroup, it should be done in the can_attach()
>>> which is called before the cgroup change.
>> 
>> Thanks for the explanations.
>> 
>> Overall, I really like the core idea, especially that the overhead on
>> context switch is bounded (by the depth of cgroup tree).
> 
> Thanks!
> 
>> 
>> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
>> Specifically, if we can have
>> 
>>  PERF_EVENT_IOC_ADD_CGROUP     add a cgroup to the list
>>  PERF_EVENT_IOC_EL_CGROUP      delete a cgroup from the list
>> 
>> we can probably share these events among multiple processes, and
>> these processes don't need to know others' cgroup list. I think
>> this will be useful for users to build customized monitoring in
>> its own container.
>> 
>> Does this make sense?
> 
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
> 
> For your multi-process sharing case, the original events' file
> descriptors should be shared first.  

Yes, we will need some other work to make the ADD/DEL interface 
work properly. Let's worry about that later. 

For both patches:

Acked-by: Song Liu <songliubraving@fb.com>

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-04-01  6:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-23 16:21 [RFC 0/2] perf core: Sharing events with multiple cgroups Namhyung Kim
2021-03-23 16:21 ` [PATCH 1/2] perf/core: Share an event " Namhyung Kim
2021-03-24  0:30   ` Song Liu
2021-03-24  1:06     ` Namhyung Kim
2021-03-25  0:55   ` Song Liu
2021-03-25  2:44     ` Namhyung Kim
2021-03-25 12:57     ` Arnaldo Carvalho de Melo
2021-03-28 17:17   ` Song Liu
2021-03-29 11:33     ` Namhyung Kim
2021-03-30  6:33       ` Song Liu
2021-03-30 15:11         ` Namhyung Kim
2021-04-01  6:00           ` Stephane Eranian
2021-04-01  6:19           ` Song Liu
2021-03-23 16:21 ` [PATCH 2/2] perf/core: Support reading group events with shared cgroups Namhyung Kim
2021-03-28 17:31   ` Song Liu
2021-03-29 11:36     ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).