[v3,1/2] perf/core: Share an event with multiple cgroups
diff mbox series

Message ID 20210413155337.644993-2-namhyung@kernel.org
State New
Headers show
Series
  • perf core: Sharing events with multiple cgroups
Related show

Commit Message

Namhyung Kim April 13, 2021, 3:53 p.m. UTC
As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run.  To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups.  All we need is a separate counter (and
two timing variables) for each cgroup.  I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different.  And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

 * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
     64-bit array to attach given cgroups.  The first element is a
     number of cgroups in the buffer, and the rest is a list of cgroup
     ids to add a cgroup info to the given event.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
     array to get the event counter values.  The first element is size
     of the array in byte, and the second element is a cgroup id to
     read.  The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple.  The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 include/linux/perf_event.h      |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c            | 480 ++++++++++++++++++++++++++++++--
 3 files changed, 477 insertions(+), 27 deletions(-)

Comments

Peter Zijlstra April 15, 2021, 2:51 p.m. UTC | #1
On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight

git grep "This patch" Documentation/

> cgroup event counting (i.e. perf stat).
> 
>  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
>      64-bit array to attach given cgroups.  The first element is a
>      number of cgroups in the buffer, and the rest is a list of cgroup
>      ids to add a cgroup info to the given event.

WTH is a cgroup-id? The syscall takes a fd to the path, why have two
different ways?

>  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
>      array to get the event counter values.  The first element is size
>      of the array in byte, and the second element is a cgroup id to
>      read.  The rest is to save the counter value and timings.

:-(

So basically you're doing a whole seconds cgroup interface, one that
violates the one counter per file premise and lives off of ioctl()s.

*IF* we're going to do something like this, I feel we should explore the
whole vector-per-fd concept before proceeding. Can we make it less yuck
(less special ioctl() and more regular file ops. Can we apply the
concept to more things?

The second patch extends the ioctl() to be more read() like, instead of
doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
or whatever. In fact, this whole second ioctl() doesn't make sense to
have if we do indeed want to do vector-per-fd.

Also, I suppose you can already fake this, by having a
SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
group with a bunch of events. Then the buffer will fill with the values
you use here.

Yes, I suppose it has higher overhead, but you get the data you want
without having to do terrible things like this.




Lots of random comments below.

> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Reported-by: kernel test robot <lkp@intel.com>

What, the whole thing?

> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  include/linux/perf_event.h      |  22 ++
>  include/uapi/linux/perf_event.h |   2 +
>  kernel/events/core.c            | 480 ++++++++++++++++++++++++++++++--
>  3 files changed, 477 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..4b03cbadf4a0 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -771,6 +771,19 @@ struct perf_event {
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> +
> +	/* to share an event for multiple cgroups */
> +	struct hlist_head		*cgrp_node_hash;
> +	struct perf_cgroup_node		*cgrp_node_entries;
> +	int				nr_cgrp_nodes;
> +	int				cgrp_node_hash_bits;
> +
> +	struct list_head		cgrp_node_entry;

Not related to perf_cgroup_node below, afaict the name is just plain
wrong.

> +
> +	/* snapshot of previous reading (for perf_cgroup_node below) */
> +	u64				cgrp_node_count;
> +	u64				cgrp_node_time_enabled;
> +	u64				cgrp_node_time_running;
>  #endif
>  
>  #ifdef CONFIG_SECURITY
> @@ -780,6 +793,13 @@ struct perf_event {
>  #endif /* CONFIG_PERF_EVENTS */
>  };
>  
> +struct perf_cgroup_node {
> +	struct hlist_node		node;
> +	u64				id;
> +	u64				count;
> +	u64				time_enabled;
> +	u64				time_running;
> +} ____cacheline_aligned;
>  
>  struct perf_event_groups {
>  	struct rb_root	tree;
> @@ -843,6 +863,8 @@ struct perf_event_context {
>  	int				pin_count;
>  #ifdef CONFIG_CGROUP_PERF
>  	int				nr_cgroups;	 /* cgroup evts */
> +	struct list_head		cgrp_node_list;

AFAICT this is actually a list of events, not a list of cgroup_node
thingies, hence the name is wrong.

> +	struct list_head		cgrp_ctx_entry;
>  #endif
>  	void				*task_ctx_data; /* pmu specific data */
>  	struct rcu_head			rcu_head;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index ad15e40d7f5d..06bc7ab13616 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
>  #define PERF_EVENT_IOC_PAUSE_OUTPUT		_IOW('$', 9, __u32)
>  #define PERF_EVENT_IOC_QUERY_BPF		_IOWR('$', 10, struct perf_event_query_bpf *)
>  #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES	_IOW('$', 11, struct perf_event_attr *)
> +#define PERF_EVENT_IOC_ATTACH_CGROUP		_IOW('$', 12, __u64 *)
> +#define PERF_EVENT_IOC_READ_CGROUP		_IOWR('$', 13, __u64 *)
>  
>  enum perf_event_ioc_flags {
>  	PERF_IOC_FLAG_GROUP		= 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f07943183041..bcf51c0b7855 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -379,6 +379,7 @@ enum event_type_t {
>   * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
>   */
>  
> +static void perf_sched_enable(void);
>  static void perf_sched_delayed(struct work_struct *work);
>  DEFINE_STATIC_KEY_FALSE(perf_sched_events);
>  static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> @@ -2124,6 +2125,323 @@ static int perf_get_aux_event(struct perf_event *event,
>  	return 1;
>  }
>  
> +#ifdef CONFIG_CGROUP_PERF
> +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> +
> +static bool event_can_attach_cgroup(struct perf_event *event)
> +{
> +	if (is_sampling_event(event))
> +		return false;
> +	if (event->attach_state & PERF_ATTACH_TASK)
> +		return false;
> +	if (is_cgroup_event(event))
> +		return false;

Why? You could be doing a subtree.

> +
> +	return true;
> +}
> +
> +static bool event_has_cgroup_node(struct perf_event *event)
> +{
> +	return event->nr_cgrp_nodes > 0;
> +}
> +
> +static struct perf_cgroup_node *
> +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> +{
> +	struct perf_cgroup_node *cgrp_node;
> +	int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> +
> +	hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> +		if (cgrp_node->id == cgrp_id)
> +			return cgrp_node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> +	u64 delta_count, delta_time_enabled, delta_time_running;
> +	int i;
> +
> +	if (event->cgrp_node_count == 0)
> +		goto out;
> +
> +	delta_count = local64_read(&event->count) - event->cgrp_node_count;
> +	delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> +	delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> +
> +	/* account delta to all ancestor cgroups */
> +	for (i = 0; i <= cgrp->level; i++) {
> +		struct perf_cgroup_node *node;
> +
> +		node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> +		if (node) {
> +			node->count += delta_count;
> +			node->time_enabled += delta_time_enabled;
> +			node->time_running += delta_time_running;
> +		}
> +	}
> +
> +out:
> +	event->cgrp_node_count = local64_read(&event->count);
> +	event->cgrp_node_time_enabled = event->total_time_enabled;
> +	event->cgrp_node_time_running = event->total_time_running;

This is wrong; there's no guarantee these are the same values you read
at the begin, IOW you could be loosing events.

> +}
> +
> +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
> +		event->pmu->read(event);
> +
> +	perf_event_update_time(event);
> +	perf_update_cgroup_node(event, cgrp);
> +}
> +
> +/* this is called from context switch */
> +static void update_cgroup_node_events(struct perf_event_context *ctx,
> +				      struct cgroup *cgrp)
> +{
> +	struct perf_event *event;
> +
> +	lockdep_assert_held(&ctx->lock);
> +
> +	if (ctx->is_active & EVENT_TIME)
> +		update_context_time(ctx);
> +
> +	list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> +		update_cgroup_node(event, cgrp);
> +}
> +
> +static void cgroup_node_sched_out(struct task_struct *task)

Naming seems confused.

> +{
> +	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> +	struct perf_event_context *ctx;
> +
> +	list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> +		raw_spin_lock(&ctx->lock);
> +		update_cgroup_node_events(ctx, cgrp->css.cgroup);
> +		raw_spin_unlock(&ctx->lock);
> +	}
> +}
> +
> +/* these are called from the when an event is enabled/disabled */

That sentence needs help.

> +static void perf_add_cgrp_node_list(struct perf_event *event,
> +				    struct perf_event_context *ctx)
> +{
> +	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> +	bool is_first;
> +
> +	lockdep_assert_irqs_disabled();
> +	lockdep_assert_held(&ctx->lock);

The latter very much implies the former, no?

> +
> +	is_first = list_empty(&ctx->cgrp_node_list);
> +	list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);

See the naming being daft.

> +
> +	if (is_first)
> +		list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);

While here it actually makes sense.

> +
> +	update_cgroup_node(event, cgrp->css.cgroup);
> +}
> +
> +static void perf_del_cgrp_node_list(struct perf_event *event,
> +				    struct perf_event_context *ctx)
> +{
> +	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> +
> +	lockdep_assert_irqs_disabled();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	update_cgroup_node(event, cgrp->css.cgroup);
> +	/* to refresh delta when it's enabled */
> +	event->cgrp_node_count = 0;
> +
> +	list_del(&event->cgrp_node_entry);
> +
> +	if (list_empty(&ctx->cgrp_node_list))
> +		list_del(&ctx->cgrp_ctx_entry);
> +}
> +
> +static void perf_attach_cgroup_node(struct perf_event *event,
> +				    struct perf_cpu_context *cpuctx,
> +				    struct perf_event_context *ctx,
> +				    void *data)
> +{
> +	if (ctx->is_active & EVENT_TIME)
> +		update_context_time(ctx);
> +
> +	perf_add_cgrp_node_list(event, ctx);
> +}
> +
> +#define MIN_CGRP_NODE_HASH  4
> +#define MAX_CGRP_NODE_HASH  (4 * 1024)

So today you think 200 cgroups is sane, tomorrow you'll complain 4k
cgroups is not enough.

> +
> +/* this is called from ioctl */
> +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> +					 u64 *cgroup_ids)
> +{
> +	struct perf_cgroup_node *cgrp_node;
> +	struct perf_event_context *ctx = event->ctx;
> +	struct hlist_head *cgrp_node_hash;
> +	int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;

How many more copies of that do we need?

> +	unsigned long flags;
> +	bool is_first = true;
> +	bool enabled;
> +	int i, nr_hash;
> +	int hash_bits;
> +
> +	if (nr_cgrps < MIN_CGRP_NODE_HASH)
> +		nr_hash = MIN_CGRP_NODE_HASH;
> +	else
> +		nr_hash = roundup_pow_of_two(nr_cgrps);
> +	hash_bits = ilog2(nr_hash);

That's like the complicated version of:

	hash_bits = 1 + ilog2(max(MIN_CGRP_NODE_HASH, nr_cgrps) - 1);

?

> +
> +	cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> +				      GFP_KERNEL, node);
> +	if (cgrp_node_hash == NULL)
> +		return -ENOMEM;
> +
> +	cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> +	if (cgrp_node == NULL) {
> +		kfree(cgrp_node_hash);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < (int)nr_cgrps; i++) {
> +		int key = hash_64(cgroup_ids[i], hash_bits);
> +
> +		cgrp_node[i].id = cgroup_ids[i];
> +		hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> +	}
> +
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> +	enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> +
> +	if (event->nr_cgrp_nodes != 0) {
> +		kfree(event->cgrp_node_hash);
> +		kfree(event->cgrp_node_entries);
> +		is_first = false;
> +	}

So if we already had cgroups attached, we just plunk whatever state we
had, without re-hashing? That's hardly sane semantics for something
called 'attach'.

And if you want this behaviour, then you should probably also accept
nr_cgrps==0, but you don't do that either.

> +
> +	event->cgrp_node_hash = cgrp_node_hash;
> +	event->cgrp_node_entries = cgrp_node;
> +	event->cgrp_node_hash_bits = hash_bits;
> +	event->nr_cgrp_nodes = nr_cgrps;
> +
> +	raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	if (is_first && enabled)
> +		event_function_call(event, perf_attach_cgroup_node, NULL);
> +
> +	return 0;
> +}
> +
> +static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
> +{
> +	struct perf_event_context *ctx = event->ctx;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> +	if (event_has_cgroup_node(event)) {
> +		if (!atomic_add_unless(&perf_sched_count, -1, 1))
> +			schedule_delayed_work(&perf_sched_work, HZ);
> +	}

Below you extract perf_sched_enable(), so this is somewhat inconsistent
for not being perf_sched_disable() I'm thinking.

Also, the placement seems weird, do you really want this under
ctx->lock?

> +
> +	kfree(event->cgrp_node_hash);
> +	kfree(event->cgrp_node_entries);
> +	event->nr_cgrp_nodes = 0;
> +
> +	raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group);
> +
> +static void __perf_read_cgroup_node(struct perf_event *event)
> +{
> +	struct perf_cgroup *cgrp;
> +
> +	if (event_has_cgroup_node(event)) {
> +		cgrp = perf_cgroup_from_task(current, NULL);
> +		perf_update_cgroup_node(event, cgrp->css.cgroup);
> +	}
> +}
> +
> +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> +				       u64 cgrp_id, char __user *buf)
> +{
> +	struct perf_cgroup_node *cgrp;
> +	struct perf_event_context *ctx = event->ctx;
> +	unsigned long flags;
> +	u64 read_format = event->attr.read_format;
> +	u64 values[4];
> +	int n = 0;
> +
> +	/* update event count and times (possibly run on other cpu) */
> +	(void)perf_event_read(event, false);
> +
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> +	cgrp = find_cgroup_node(event, cgrp_id);
> +	if (cgrp == NULL) {
> +		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		return -ENOENT;
> +	}
> +
> +	values[n++] = cgrp->count;
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> +		values[n++] = cgrp->time_enabled;
> +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> +		values[n++] = cgrp->time_running;
> +	if (read_format & PERF_FORMAT_ID)
> +		values[n++] = primary_event_id(event);
> +
> +	raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	if (copy_to_user(buf, values, n * sizeof(u64)))
> +		return -EFAULT;
> +
> +	return n * sizeof(u64);
> +}
> +#else  /* !CONFIG_CGROUP_PERF */
> +static inline bool event_can_attach_cgroup(struct perf_event *event)
> +{
> +	return false;
> +}
> +
> +static inline bool event_has_cgroup_node(struct perf_event *event)
> +{
> +	return false;
> +}
> +
> +static inline void cgroup_node_sched_out(struct task_struct *task) {}
> +
> +static inline void perf_add_cgrp_node_list(struct perf_event *event,
> +					   struct perf_event_context *ctx) {}
> +static inline void perf_del_cgrp_node_list(struct perf_event *event,
> +					   struct perf_event_context *ctx) {}
> +
> +#define MAX_CGRP_NODE_HASH  1
> +static inline int perf_event_attach_cgroup_node(struct perf_event *event,
> +						u64 nr_cgrps, u64 *cgrp_ids)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
> +static inline  void __perf_read_cgroup_node(struct perf_event *event) {}
> +
> +static inline int perf_event_read_cgroup_node(struct perf_event *event,
> +					      u64 read_size, u64 cgrp_id,
> +					      char __user *buf)
> +{
> +	return -EINVAL;
> +}
> +#endif  /* CONFIG_CGROUP_PERF */
> +
>  static inline struct list_head *get_event_list(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -2407,6 +2725,7 @@ static void __perf_event_disable(struct perf_event *event,
>  
>  	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  	perf_cgroup_event_disable(event, ctx);
> +	perf_del_cgrp_node_list(event, ctx);
>  }
>  
>  /*
> @@ -2946,6 +3265,7 @@ static void __perf_event_enable(struct perf_event *event,
>  
>  	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  	perf_cgroup_event_enable(event, ctx);
> +	perf_add_cgrp_node_list(event, ctx);
>  
>  	if (!ctx->is_active)
>  		return;
> @@ -3568,6 +3888,13 @@ void __perf_event_task_sched_out(struct task_struct *task,
>  	 */
>  	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
>  		perf_cgroup_sched_out(task, next);
> +
> +#ifdef CONFIG_CGROUP_PERF
> +	if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
> +	    perf_cgroup_from_task(task, NULL) !=
> +	    perf_cgroup_from_task(next, NULL))
> +		cgroup_node_sched_out(task);
> +#endif

Please, fold this into that one cgroup branch you already have here.
Don't pullute things further.

>  }
>  
>  /*
> @@ -4268,6 +4595,7 @@ static void __perf_event_read(void *info)
>  
>  	if (!data->group) {
>  		pmu->read(event);
> +		__perf_read_cgroup_node(event);
>  		data->ret = 0;
>  		goto unlock;
>  	}
> @@ -4283,6 +4611,7 @@ static void __perf_event_read(void *info)
>  			 * sibling could be on different (eg: software) PMU.
>  			 */
>  			sub->pmu->read(sub);
> +			__perf_read_cgroup_node(sub);
>  		}
>  	}
>  

Why though; nothing here looks at the new cgroup state.

> @@ -4462,6 +4791,10 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
>  	INIT_LIST_HEAD(&ctx->pinned_active);
>  	INIT_LIST_HEAD(&ctx->flexible_active);
>  	refcount_set(&ctx->refcount, 1);
> +#ifdef CONFIG_CGROUP_PERF
> +	INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> +	INIT_LIST_HEAD(&ctx->cgrp_node_list);
> +#endif
>  }
>  
>  static struct perf_event_context *
> @@ -4851,6 +5184,8 @@ static void _free_event(struct perf_event *event)
>  	if (is_cgroup_event(event))
>  		perf_detach_cgroup(event);
>  
> +	perf_event_destroy_cgroup_nodes(event);
> +
>  	if (!event->parent) {
>  		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
>  			put_callchain_buffers();
> @@ -5571,6 +5906,58 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
>  
>  		return perf_event_modify_attr(event,  &new_attr);
>  	}
> +
> +	case PERF_EVENT_IOC_ATTACH_CGROUP: {
> +		u64 nr_cgrps;
> +		u64 *cgrp_buf;
> +		size_t cgrp_bufsz;
> +		int ret;
> +
> +		if (!event_can_attach_cgroup(event))
> +			return -EINVAL;
> +
> +		if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
> +				   sizeof(nr_cgrps)))
> +			return -EFAULT;
> +
> +		if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
> +			return -EINVAL;
> +
> +		cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
> +
> +		cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
> +		if (cgrp_buf == NULL)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
> +				   cgrp_bufsz)) {
> +			kfree(cgrp_buf);
> +			return -EFAULT;
> +		}
> +
> +		ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
> +
> +		kfree(cgrp_buf);
> +		return ret;
> +	}
> +
> +	case PERF_EVENT_IOC_READ_CGROUP: {
> +		u64 read_size, cgrp_id;
> +
> +		if (!event_can_attach_cgroup(event))
> +			return -EINVAL;
> +
> +		if (copy_from_user(&read_size, (u64 __user *)arg,
> +				   sizeof(read_size)))
> +			return -EFAULT;
> +		if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
> +				   sizeof(cgrp_id)))
> +			return -EFAULT;
> +
> +		return perf_event_read_cgroup_node(event, read_size, cgrp_id,
> +						   (char __user *)(arg + 16));
> +	}
> +
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -5583,10 +5970,39 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
>  	return 0;
>  }
>  
> +static void perf_sched_enable(void)
> +{
> +	/*
> +	 * We need the mutex here because static_branch_enable()
> +	 * must complete *before* the perf_sched_count increment
> +	 * becomes visible.
> +	 */
> +	if (atomic_inc_not_zero(&perf_sched_count))
> +		return;
> +
> +	mutex_lock(&perf_sched_mutex);
> +	if (!atomic_read(&perf_sched_count)) {
> +		static_branch_enable(&perf_sched_events);
> +		/*
> +		 * Guarantee that all CPUs observe they key change and
> +		 * call the perf scheduling hooks before proceeding to
> +		 * install events that need them.
> +		 */
> +		synchronize_rcu();
> +	}
> +	/*
> +	 * Now that we have waited for the sync_sched(), allow further
> +	 * increments to by-pass the mutex.
> +	 */
> +	atomic_inc(&perf_sched_count);
> +	mutex_unlock(&perf_sched_mutex);
> +}

Per the above, this is missing perf_sched_disable(). Also, this should
probably be a separate patch then.

> +
>  static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>  	struct perf_event *event = file->private_data;
>  	struct perf_event_context *ctx;
> +	bool do_sched_enable = false;
>  	long ret;
>  
>  	/* Treat ioctl like writes as it is likely a mutating operation. */
> @@ -5595,9 +6011,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  		return ret;
>  
>  	ctx = perf_event_ctx_lock(event);
> +	/* ATTACH_CGROUP requires context switch callback */
> +	if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> +		do_sched_enable = true;
>  	ret = _perf_ioctl(event, cmd, arg);
>  	perf_event_ctx_unlock(event, ctx);
>  
> +	/*
> +	 * Due to the circular lock dependency, it cannot call
> +	 * static_branch_enable() under the ctx->mutex.
> +	 */
> +	if (do_sched_enable && ret >= 0)
> +		perf_sched_enable();
> +
>  	return ret;
>  }

Hurmph... not much choice there I suppose.

> @@ -11240,33 +11666,8 @@ static void account_event(struct perf_event *event)
>  	if (event->attr.text_poke)
>  		atomic_inc(&nr_text_poke_events);
>  
> -	if (inc) {
> -		/*
> -		 * We need the mutex here because static_branch_enable()
> -		 * must complete *before* the perf_sched_count increment
> -		 * becomes visible.
> -		 */
> -		if (atomic_inc_not_zero(&perf_sched_count))
> -			goto enabled;
> -
> -		mutex_lock(&perf_sched_mutex);
> -		if (!atomic_read(&perf_sched_count)) {
> -			static_branch_enable(&perf_sched_events);
> -			/*
> -			 * Guarantee that all CPUs observe they key change and
> -			 * call the perf scheduling hooks before proceeding to
> -			 * install events that need them.
> -			 */
> -			synchronize_rcu();
> -		}
> -		/*
> -		 * Now that we have waited for the sync_sched(), allow further
> -		 * increments to by-pass the mutex.
> -		 */
> -		atomic_inc(&perf_sched_count);
> -		mutex_unlock(&perf_sched_mutex);
> -	}
> -enabled:
> +	if (inc)
> +		perf_sched_enable();
>  
>  	account_event_cpu(event, event->cpu);
>  
> @@ -13008,6 +13409,7 @@ static void __init perf_event_init_all_cpus(void)
>  
>  #ifdef CONFIG_CGROUP_PERF
>  		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> +		INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
>  #endif
>  		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
>  	}
> @@ -13218,6 +13620,29 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
>  	return 0;
>  }
>  
> +static int __perf_cgroup_update_node(void *info)
> +{
> +	struct task_struct *task = info;
> +
> +	rcu_read_lock();
> +	cgroup_node_sched_out(task);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
> +/* update cgroup counter BEFORE task's cgroup is changed */
> +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> +{
> +	struct task_struct *task;
> +	struct cgroup_subsys_state *css;
> +
> +	cgroup_taskset_for_each(task, css, tset)
> +		task_function_call(task, __perf_cgroup_update_node, task);
> +
> +	return 0;
> +}
> +
>  static int __perf_cgroup_move(void *info)
>  {
>  	struct task_struct *task = info;
> @@ -13240,6 +13665,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
>  	.css_alloc	= perf_cgroup_css_alloc,
>  	.css_free	= perf_cgroup_css_free,
>  	.css_online	= perf_cgroup_css_online,
> +	.can_attach	= perf_cgroup_can_attach,
>  	.attach		= perf_cgroup_attach,
>  	/*
>  	 * Implicitly enable on dfl hierarchy so that perf events can
Namhyung Kim April 15, 2021, 11:48 p.m. UTC | #2
Hi Peter,

Thanks for your review!

On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
>
> git grep "This patch" Documentation/

Ok, will change.

>
> > cgroup event counting (i.e. perf stat).
> >
> >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >      64-bit array to attach given cgroups.  The first element is a
> >      number of cgroups in the buffer, and the rest is a list of cgroup
> >      ids to add a cgroup info to the given event.
>
> WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> different ways?

As you know, we already use cgroup-id for sampling.  Yeah we
can do it with the fd but one of the point in this patch is to reduce
the number of file descriptors. :)

Also, having cgroup-id is good to match with the result (from read)
as it contains the cgroup information.


>
> >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >      array to get the event counter values.  The first element is size
> >      of the array in byte, and the second element is a cgroup id to
> >      read.  The rest is to save the counter value and timings.
>
> :-(
>
> So basically you're doing a whole seconds cgroup interface, one that
> violates the one counter per file premise and lives off of ioctl()s.

Right, but I'm not sure that we really want a separate event for each
cgroup if underlying hardware events are all the same.

>
> *IF* we're going to do something like this, I feel we should explore the
> whole vector-per-fd concept before proceeding. Can we make it less yuck
> (less special ioctl() and more regular file ops. Can we apply the
> concept to more things?

Ideally it'd do without keeping file descriptors open.  Maybe we can make
the vector accept various types like vector-per-cgroup_id or so.

>
> The second patch extends the ioctl() to be more read() like, instead of
> doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> or whatever. In fact, this whole second ioctl() doesn't make sense to
> have if we do indeed want to do vector-per-fd.

One of the upside of the ioctl() is that we can pass cgroup-id to read.
Probably we can keep the index in the vector and set the file offset
with it.  Or else just read the whole vector, and then it has a cgroup-id
in the output like PERF_FORMAT_CGROUP?

>
> Also, I suppose you can already fake this, by having a
> SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event

Thanks!

> with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> group with a bunch of events. Then the buffer will fill with the values
> you use here.

Right, I'll do an experiment with it.

>
> Yes, I suppose it has higher overhead, but you get the data you want
> without having to do terrible things like this.

That's true.  And we don't need many things in the perf record like
synthesizing task/mmap info.  Also there's a risk we can miss some
samples for some reason.

Another concern is that it'd add huge slow down in the perf event
open as it creates a mixed sw/hw group.  The synchronized_rcu in
the move_cgroup path caused significant problems in my
environment as it adds up in proportion to the number of cpus.

>
>
>
>
> Lots of random comments below.

Thanks for the review, I'll reply in a separate thread.

Namhyung
Peter Zijlstra April 16, 2021, 9:26 a.m. UTC | #3
On Fri, Apr 16, 2021 at 08:48:12AM +0900, Namhyung Kim wrote:
> On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:

> > > cgroup event counting (i.e. perf stat).
> > >
> > >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > >      64-bit array to attach given cgroups.  The first element is a
> > >      number of cgroups in the buffer, and the rest is a list of cgroup
> > >      ids to add a cgroup info to the given event.
> >
> > WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> > different ways?
> 
> As you know, we already use cgroup-id for sampling.  Yeah we
> can do it with the fd but one of the point in this patch is to reduce
> the number of file descriptors. :)

Well, I found those patches again after I wrote that. But I'm still not
sure what a cgroup-id is from userspace.

How does userspace get one given a cgroup? (I actually mounted cgroupfs
in order to see if there's some new 'id' file to read, there is not)
Does having the cgroup-id ensure the cgroup exists? Can the cgroup-id
get re-used?

I really don't konw what the thing is. I don't use cgroups, like ever,
except when I'm forced to due to some regression or bugreport.

> Also, having cgroup-id is good to match with the result (from read)
> as it contains the cgroup information.

What?

> > >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > >      array to get the event counter values.  The first element is size
> > >      of the array in byte, and the second element is a cgroup id to
> > >      read.  The rest is to save the counter value and timings.
> >
> > :-(
> >
> > So basically you're doing a whole seconds cgroup interface, one that
> > violates the one counter per file premise and lives off of ioctl()s.
> 
> Right, but I'm not sure that we really want a separate event for each
> cgroup if underlying hardware events are all the same.

Sure, I see where you're coming from; I just don't much like where it
got you :-)

> > *IF* we're going to do something like this, I feel we should explore the
> > whole vector-per-fd concept before proceeding. Can we make it less yuck
> > (less special ioctl() and more regular file ops. Can we apply the
> > concept to more things?
> 
> Ideally it'd do without keeping file descriptors open.  Maybe we can make
> the vector accept various types like vector-per-cgroup_id or so.

So I think we've had proposals for being able to close fds in the past;
while preserving groups etc. We've always pushed back on that because of
the resource limit issue. By having each counter be a filedesc we get a
natural limit on the amount of resources you can consume. And in that
respect, having to use 400k fds is things working as designed.

Anyway, there might be a way around this..

> > The second patch extends the ioctl() to be more read() like, instead of
> > doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> > or whatever. In fact, this whole second ioctl() doesn't make sense to
> > have if we do indeed want to do vector-per-fd.
> 
> One of the upside of the ioctl() is that we can pass cgroup-id to read.
> Probably we can keep the index in the vector and set the file offset
> with it.  Or else just read the whole vector, and then it has a cgroup-id
> in the output like PERF_FORMAT_CGROUP?
> 
> >
> > Also, I suppose you can already fake this, by having a
> > SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
> 
> Thanks!
> 
> > with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> > group with a bunch of events. Then the buffer will fill with the values
> > you use here.
> 
> Right, I'll do an experiment with it.
> 
> >
> > Yes, I suppose it has higher overhead, but you get the data you want
> > without having to do terrible things like this.
> 
> That's true.  And we don't need many things in the perf record like
> synthesizing task/mmap info.  Also there's a risk we can miss some
> samples for some reason.
> 
> Another concern is that it'd add huge slow down in the perf event
> open as it creates a mixed sw/hw group.  The synchronized_rcu in
> the move_cgroup path caused significant problems in my
> environment as it adds up in proportion to the number of cpus.

Since when is perf_event_open() a performance concern? That thing is
slow in all possible ways.
Peter Zijlstra April 16, 2021, 9:29 a.m. UTC | #4
Duh.. this is a half-finished email I meant to save for later. Anyway,
I'll reply more.

On Fri, Apr 16, 2021 at 11:26:39AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 16, 2021 at 08:48:12AM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> 
> > > > cgroup event counting (i.e. perf stat).
> > > >
> > > >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > > >      64-bit array to attach given cgroups.  The first element is a
> > > >      number of cgroups in the buffer, and the rest is a list of cgroup
> > > >      ids to add a cgroup info to the given event.
> > >
> > > WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> > > different ways?
> > 
> > As you know, we already use cgroup-id for sampling.  Yeah we
> > can do it with the fd but one of the point in this patch is to reduce
> > the number of file descriptors. :)
> 
> Well, I found those patches again after I wrote that. But I'm still not
> sure what a cgroup-id is from userspace.
> 
> How does userspace get one given a cgroup? (I actually mounted cgroupfs
> in order to see if there's some new 'id' file to read, there is not)
> Does having the cgroup-id ensure the cgroup exists? Can the cgroup-id
> get re-used?
> 
> I really don't konw what the thing is. I don't use cgroups, like ever,
> except when I'm forced to due to some regression or bugreport.
> 
> > Also, having cgroup-id is good to match with the result (from read)
> > as it contains the cgroup information.
> 
> What?
> 
> > > >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > > >      array to get the event counter values.  The first element is size
> > > >      of the array in byte, and the second element is a cgroup id to
> > > >      read.  The rest is to save the counter value and timings.
> > >
> > > :-(
> > >
> > > So basically you're doing a whole seconds cgroup interface, one that
> > > violates the one counter per file premise and lives off of ioctl()s.
> > 
> > Right, but I'm not sure that we really want a separate event for each
> > cgroup if underlying hardware events are all the same.
> 
> Sure, I see where you're coming from; I just don't much like where it
> got you :-)
> 
> > > *IF* we're going to do something like this, I feel we should explore the
> > > whole vector-per-fd concept before proceeding. Can we make it less yuck
> > > (less special ioctl() and more regular file ops. Can we apply the
> > > concept to more things?
> > 
> > Ideally it'd do without keeping file descriptors open.  Maybe we can make
> > the vector accept various types like vector-per-cgroup_id or so.
> 
> So I think we've had proposals for being able to close fds in the past;
> while preserving groups etc. We've always pushed back on that because of
> the resource limit issue. By having each counter be a filedesc we get a
> natural limit on the amount of resources you can consume. And in that
> respect, having to use 400k fds is things working as designed.
> 
> Anyway, there might be a way around this..
> 
> > > The second patch extends the ioctl() to be more read() like, instead of
> > > doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> > > or whatever. In fact, this whole second ioctl() doesn't make sense to
> > > have if we do indeed want to do vector-per-fd.
> > 
> > One of the upside of the ioctl() is that we can pass cgroup-id to read.
> > Probably we can keep the index in the vector and set the file offset
> > with it.  Or else just read the whole vector, and then it has a cgroup-id
> > in the output like PERF_FORMAT_CGROUP?
> > 
> > >
> > > Also, I suppose you can already fake this, by having a
> > > SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
> > 
> > Thanks!
> > 
> > > with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> > > group with a bunch of events. Then the buffer will fill with the values
> > > you use here.
> > 
> > Right, I'll do an experiment with it.
> > 
> > >
> > > Yes, I suppose it has higher overhead, but you get the data you want
> > > without having to do terrible things like this.
> > 
> > That's true.  And we don't need many things in the perf record like
> > synthesizing task/mmap info.  Also there's a risk we can miss some
> > samples for some reason.
> > 
> > Another concern is that it'd add huge slow down in the perf event
> > open as it creates a mixed sw/hw group.  The synchronized_rcu in
> > the move_cgroup path caused significant problems in my
> > environment as it adds up in proportion to the number of cpus.
> 
> Since when is perf_event_open() a performance concern? That thing is
> slow in all possible ways.
Namhyung Kim April 16, 2021, 9:49 a.m. UTC | #5
On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> Lots of random comments below.
>
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <tj@kernel.org>
> > Reported-by: kernel test robot <lkp@intel.com>
>
> What, the whole thing?

Oh, it's just for build issues when !CONFIG_CGROUP_PERF

>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> >  include/linux/perf_event.h      |  22 ++
> >  include/uapi/linux/perf_event.h |   2 +
> >  kernel/events/core.c            | 480 ++++++++++++++++++++++++++++++--
> >  3 files changed, 477 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..4b03cbadf4a0 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,19 @@ struct perf_event {
> >
> >  #ifdef CONFIG_CGROUP_PERF
> >       struct perf_cgroup              *cgrp; /* cgroup event is attach to */
> > +
> > +     /* to share an event for multiple cgroups */
> > +     struct hlist_head               *cgrp_node_hash;
> > +     struct perf_cgroup_node         *cgrp_node_entries;
> > +     int                             nr_cgrp_nodes;
> > +     int                             cgrp_node_hash_bits;
> > +
> > +     struct list_head                cgrp_node_entry;
>
> Not related to perf_cgroup_node below, afaict the name is just plain
> wrong.

Right, it should be cgrp_event_entry or something, but we
have the notion of "cgroup event" for a different thing.
Maybe cgrp_node_event_entry or cgrp_vec_event_entry
(once we get the vector support)?

>
> > +
> > +     /* snapshot of previous reading (for perf_cgroup_node below) */
> > +     u64                             cgrp_node_count;
> > +     u64                             cgrp_node_time_enabled;
> > +     u64                             cgrp_node_time_running;
> >  #endif
> >
> >  #ifdef CONFIG_SECURITY
> > @@ -780,6 +793,13 @@ struct perf_event {
> >  #endif /* CONFIG_PERF_EVENTS */
> >  };
> >
> > +struct perf_cgroup_node {
> > +     struct hlist_node               node;
> > +     u64                             id;
> > +     u64                             count;
> > +     u64                             time_enabled;
> > +     u64                             time_running;
> > +} ____cacheline_aligned;
> >
> >  struct perf_event_groups {
> >       struct rb_root  tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> >       int                             pin_count;
> >  #ifdef CONFIG_CGROUP_PERF
> >       int                             nr_cgroups;      /* cgroup evts */
> > +     struct list_head                cgrp_node_list;
>
> AFAICT this is actually a list of events, not a list of cgroup_node
> thingies, hence the name is wrong.

Correct, will update.

>
> > +     struct list_head                cgrp_ctx_entry;
> >  #endif
> >       void                            *task_ctx_data; /* pmu specific data */
> >       struct rcu_head                 rcu_head;
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index ad15e40d7f5d..06bc7ab13616 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> >  #define PERF_EVENT_IOC_PAUSE_OUTPUT          _IOW('$', 9, __u32)
> >  #define PERF_EVENT_IOC_QUERY_BPF             _IOWR('$', 10, struct perf_event_query_bpf *)
> >  #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES     _IOW('$', 11, struct perf_event_attr *)
> > +#define PERF_EVENT_IOC_ATTACH_CGROUP         _IOW('$', 12, __u64 *)
> > +#define PERF_EVENT_IOC_READ_CGROUP           _IOWR('$', 13, __u64 *)
> >
> >  enum perf_event_ioc_flags {
> >       PERF_IOC_FLAG_GROUP             = 1U << 0,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f07943183041..bcf51c0b7855 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -379,6 +379,7 @@ enum event_type_t {
> >   * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> >   */
> >
> > +static void perf_sched_enable(void);
> >  static void perf_sched_delayed(struct work_struct *work);
> >  DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> >  static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> > @@ -2124,6 +2125,323 @@ static int perf_get_aux_event(struct perf_event *event,
> >       return 1;
> >  }
> >
> > +#ifdef CONFIG_CGROUP_PERF
> > +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> > +
> > +static bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > +     if (is_sampling_event(event))
> > +             return false;
> > +     if (event->attach_state & PERF_ATTACH_TASK)
> > +             return false;
> > +     if (is_cgroup_event(event))
> > +             return false;
>
> Why? You could be doing a subtree.

Well.. I didn't want to make it more complicated.  And we can
do cgroup subtree monitoring with the new interface alone.

>
> > +
> > +     return true;
> > +}
> > +
> > +static bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > +     return event->nr_cgrp_nodes > 0;
> > +}
> > +
> > +static struct perf_cgroup_node *
> > +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> > +
> > +     hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> > +             if (cgrp_node->id == cgrp_id)
> > +                     return cgrp_node;
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     u64 delta_count, delta_time_enabled, delta_time_running;
> > +     int i;
> > +
> > +     if (event->cgrp_node_count == 0)
> > +             goto out;
> > +
> > +     delta_count = local64_read(&event->count) - event->cgrp_node_count;
> > +     delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > +     delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > +
> > +     /* account delta to all ancestor cgroups */
> > +     for (i = 0; i <= cgrp->level; i++) {
> > +             struct perf_cgroup_node *node;
> > +
> > +             node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > +             if (node) {
> > +                     node->count += delta_count;
> > +                     node->time_enabled += delta_time_enabled;
> > +                     node->time_running += delta_time_running;
> > +             }
> > +     }
> > +
> > +out:
> > +     event->cgrp_node_count = local64_read(&event->count);
> > +     event->cgrp_node_time_enabled = event->total_time_enabled;
> > +     event->cgrp_node_time_running = event->total_time_running;
>
> This is wrong; there's no guarantee these are the same values you read
> at the begin, IOW you could be loosing events.

Could you please elaborate?

They are read first during perf_event_enable (perf_add_cgrp_node_list),
and will be updated in cgroup switches.  And it should be fine if the event
was scheduled in/out (due to multiplexing) in the middle as it collects the
enabled/running time as well.

>
> > +}
> > +
> > +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     if (event->state == PERF_EVENT_STATE_ACTIVE)
> > +             event->pmu->read(event);
> > +
> > +     perf_event_update_time(event);
> > +     perf_update_cgroup_node(event, cgrp);
> > +}
> > +
> > +/* this is called from context switch */
> > +static void update_cgroup_node_events(struct perf_event_context *ctx,
> > +                                   struct cgroup *cgrp)
> > +{
> > +     struct perf_event *event;
> > +
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     if (ctx->is_active & EVENT_TIME)
> > +             update_context_time(ctx);
> > +
> > +     list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> > +             update_cgroup_node(event, cgrp);
> > +}
> > +
> > +static void cgroup_node_sched_out(struct task_struct *task)
>
> Naming seems confused.

What about cgrp_node_event_sched_out?  (I know I'm terrible at
naming..)

>
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> > +     struct perf_event_context *ctx;
> > +
> > +     list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> > +             raw_spin_lock(&ctx->lock);
> > +             update_cgroup_node_events(ctx, cgrp->css.cgroup);
> > +             raw_spin_unlock(&ctx->lock);
> > +     }
> > +}
> > +
> > +/* these are called from the when an event is enabled/disabled */
>
> That sentence needs help.

Ok.  When an event is enabled, it puts the event in the context list
and puts the context in the per-cpu list if it was not.  Also updates
snapshot of cgroup node count and timestamps to be used during
(cgroup) context switches.

When the event is disabled, it updates the current cgroup's node
(if exists) and removes the event (and possibly the context) from
the list.

>
> > +static void perf_add_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +     bool is_first;
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
>
> The latter very much implies the former, no?

Right, will remove.

>
> > +
> > +     is_first = list_empty(&ctx->cgrp_node_list);
> > +     list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
>
> See the naming being daft.

I see.

>
> > +
> > +     if (is_first)
> > +             list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
>
> While here it actually makes sense.

:)

>
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
> > +}
> > +
> > +static void perf_del_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
> > +     /* to refresh delta when it's enabled */
> > +     event->cgrp_node_count = 0;
> > +
> > +     list_del(&event->cgrp_node_entry);
> > +
> > +     if (list_empty(&ctx->cgrp_node_list))
> > +             list_del(&ctx->cgrp_ctx_entry);
> > +}
> > +
> > +static void perf_attach_cgroup_node(struct perf_event *event,
> > +                                 struct perf_cpu_context *cpuctx,
> > +                                 struct perf_event_context *ctx,
> > +                                 void *data)
> > +{
> > +     if (ctx->is_active & EVENT_TIME)
> > +             update_context_time(ctx);
> > +
> > +     perf_add_cgrp_node_list(event, ctx);
> > +}
> > +
> > +#define MIN_CGRP_NODE_HASH  4
> > +#define MAX_CGRP_NODE_HASH  (4 * 1024)
>
> So today you think 200 cgroups is sane, tomorrow you'll complain 4k
> cgroups is not enough.

Maybe.. I just wanted to prevent too large malicious allocations but
cannot determine what's the proper upper bound.  Hmm.. we can
add a sysctl knob for this if you want.

>
> > +
> > +/* this is called from ioctl */
> > +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> > +                                      u64 *cgroup_ids)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     struct hlist_head *cgrp_node_hash;
> > +     int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
>
> How many more copies of that do we need?

Will refactor this bit into a static inline function.

>
> > +     unsigned long flags;
> > +     bool is_first = true;
> > +     bool enabled;
> > +     int i, nr_hash;
> > +     int hash_bits;
> > +
> > +     if (nr_cgrps < MIN_CGRP_NODE_HASH)
> > +             nr_hash = MIN_CGRP_NODE_HASH;
> > +     else
> > +             nr_hash = roundup_pow_of_two(nr_cgrps);
> > +     hash_bits = ilog2(nr_hash);
>
> That's like the complicated version of:
>
>         hash_bits = 1 + ilog2(max(MIN_CGRP_NODE_HASH, nr_cgrps) - 1);
>
> ?

Great, will update.

>
> > +
> > +     cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> > +                                   GFP_KERNEL, node);
> > +     if (cgrp_node_hash == NULL)
> > +             return -ENOMEM;
> > +
> > +     cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> > +     if (cgrp_node == NULL) {
> > +             kfree(cgrp_node_hash);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     for (i = 0; i < (int)nr_cgrps; i++) {
> > +             int key = hash_64(cgroup_ids[i], hash_bits);
> > +
> > +             cgrp_node[i].id = cgroup_ids[i];
> > +             hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> > +     }
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> > +
> > +     if (event->nr_cgrp_nodes != 0) {
> > +             kfree(event->cgrp_node_hash);
> > +             kfree(event->cgrp_node_entries);
> > +             is_first = false;
> > +     }
>
> So if we already had cgroups attached, we just plunk whatever state we
> had, without re-hashing? That's hardly sane semantics for something
> called 'attach'.

I think that's what Song wanted and we can make it work that way.
Basically I wanted to avoid many small allocations for nodes.

>
> And if you want this behaviour, then you should probably also accept
> nr_cgrps==0, but you don't do that either.

OK, probably I can go with the add and re-hash approach then.

>
> > +
> > +     event->cgrp_node_hash = cgrp_node_hash;
> > +     event->cgrp_node_entries = cgrp_node;
> > +     event->cgrp_node_hash_bits = hash_bits;
> > +     event->nr_cgrp_nodes = nr_cgrps;
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     if (is_first && enabled)
> > +             event_function_call(event, perf_attach_cgroup_node, NULL);
> > +
> > +     return 0;
> > +}
> > +
> > +static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
> > +{
> > +     struct perf_event_context *ctx = event->ctx;
> > +     unsigned long flags;
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     if (event_has_cgroup_node(event)) {
> > +             if (!atomic_add_unless(&perf_sched_count, -1, 1))
> > +                     schedule_delayed_work(&perf_sched_work, HZ);
> > +     }
>
> Below you extract perf_sched_enable(), so this is somewhat inconsistent
> for not being perf_sched_disable() I'm thinking.

I can add that too.

>
> Also, the placement seems weird, do you really want this under
> ctx->lock?

It's because it needs to check the presence of the cgroup nodes
under the lock.

>
> > +
> > +     kfree(event->cgrp_node_hash);
> > +     kfree(event->cgrp_node_entries);
> > +     event->nr_cgrp_nodes = 0;
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +}
> > +
> > +static int perf_event_read(struct perf_event *event, bool group);
> > +
> > +static void __perf_read_cgroup_node(struct perf_event *event)
> > +{
> > +     struct perf_cgroup *cgrp;
> > +
> > +     if (event_has_cgroup_node(event)) {
> > +             cgrp = perf_cgroup_from_task(current, NULL);
> > +             perf_update_cgroup_node(event, cgrp->css.cgroup);
> > +     }
> > +}
> > +
> > +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> > +                                    u64 cgrp_id, char __user *buf)
> > +{
> > +     struct perf_cgroup_node *cgrp;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     unsigned long flags;
> > +     u64 read_format = event->attr.read_format;
> > +     u64 values[4];
> > +     int n = 0;
> > +
> > +     /* update event count and times (possibly run on other cpu) */
> > +     (void)perf_event_read(event, false);
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     cgrp = find_cgroup_node(event, cgrp_id);
> > +     if (cgrp == NULL) {
> > +             raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +             return -ENOENT;
> > +     }
> > +
> > +     values[n++] = cgrp->count;
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> > +             values[n++] = cgrp->time_enabled;
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> > +             values[n++] = cgrp->time_running;
> > +     if (read_format & PERF_FORMAT_ID)
> > +             values[n++] = primary_event_id(event);
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     if (copy_to_user(buf, values, n * sizeof(u64)))
> > +             return -EFAULT;
> > +
> > +     return n * sizeof(u64);
> > +}
> > +#else  /* !CONFIG_CGROUP_PERF */
> > +static inline bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > +     return false;
> > +}
> > +
> > +static inline bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > +     return false;
> > +}
> > +
> > +static inline void cgroup_node_sched_out(struct task_struct *task) {}
> > +
> > +static inline void perf_add_cgrp_node_list(struct perf_event *event,
> > +                                        struct perf_event_context *ctx) {}
> > +static inline void perf_del_cgrp_node_list(struct perf_event *event,
> > +                                        struct perf_event_context *ctx) {}
> > +
> > +#define MAX_CGRP_NODE_HASH  1
> > +static inline int perf_event_attach_cgroup_node(struct perf_event *event,
> > +                                             u64 nr_cgrps, u64 *cgrp_ids)
> > +{
> > +     return -ENODEV;
> > +}
> > +
> > +static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
> > +static inline  void __perf_read_cgroup_node(struct perf_event *event) {}
> > +
> > +static inline int perf_event_read_cgroup_node(struct perf_event *event,
> > +                                           u64 read_size, u64 cgrp_id,
> > +                                           char __user *buf)
> > +{
> > +     return -EINVAL;
> > +}
> > +#endif  /* CONFIG_CGROUP_PERF */
> > +
> >  static inline struct list_head *get_event_list(struct perf_event *event)
> >  {
> >       struct perf_event_context *ctx = event->ctx;
> > @@ -2407,6 +2725,7 @@ static void __perf_event_disable(struct perf_event *event,
> >
> >       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> >       perf_cgroup_event_disable(event, ctx);
> > +     perf_del_cgrp_node_list(event, ctx);
> >  }
> >
> >  /*
> > @@ -2946,6 +3265,7 @@ static void __perf_event_enable(struct perf_event *event,
> >
> >       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
> >       perf_cgroup_event_enable(event, ctx);
> > +     perf_add_cgrp_node_list(event, ctx);
> >
> >       if (!ctx->is_active)
> >               return;
> > @@ -3568,6 +3888,13 @@ void __perf_event_task_sched_out(struct task_struct *task,
> >        */
> >       if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
> >               perf_cgroup_sched_out(task, next);
> > +
> > +#ifdef CONFIG_CGROUP_PERF
> > +     if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
> > +         perf_cgroup_from_task(task, NULL) !=
> > +         perf_cgroup_from_task(next, NULL))
> > +             cgroup_node_sched_out(task);
> > +#endif
>
> Please, fold this into that one cgroup branch you already have here.
> Don't pullute things further.

Will do.

>
> >  }
> >
> >  /*
> > @@ -4268,6 +4595,7 @@ static void __perf_event_read(void *info)
> >
> >       if (!data->group) {
> >               pmu->read(event);
> > +             __perf_read_cgroup_node(event);
> >               data->ret = 0;
> >               goto unlock;
> >       }
> > @@ -4283,6 +4611,7 @@ static void __perf_event_read(void *info)
> >                        * sibling could be on different (eg: software) PMU.
> >                        */
> >                       sub->pmu->read(sub);
> > +                     __perf_read_cgroup_node(sub);
> >               }
> >       }
> >
>
> Why though; nothing here looks at the new cgroup state.
>
> > @@ -4462,6 +4791,10 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
> >       INIT_LIST_HEAD(&ctx->pinned_active);
> >       INIT_LIST_HEAD(&ctx->flexible_active);
> >       refcount_set(&ctx->refcount, 1);
> > +#ifdef CONFIG_CGROUP_PERF
> > +     INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> > +     INIT_LIST_HEAD(&ctx->cgrp_node_list);
> > +#endif
> >  }
> >
> >  static struct perf_event_context *
> > @@ -4851,6 +5184,8 @@ static void _free_event(struct perf_event *event)
> >       if (is_cgroup_event(event))
> >               perf_detach_cgroup(event);
> >
> > +     perf_event_destroy_cgroup_nodes(event);
> > +
> >       if (!event->parent) {
> >               if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> >                       put_callchain_buffers();
> > @@ -5571,6 +5906,58 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> >
> >               return perf_event_modify_attr(event,  &new_attr);
> >       }
> > +
> > +     case PERF_EVENT_IOC_ATTACH_CGROUP: {
> > +             u64 nr_cgrps;
> > +             u64 *cgrp_buf;
> > +             size_t cgrp_bufsz;
> > +             int ret;
> > +
> > +             if (!event_can_attach_cgroup(event))
> > +                     return -EINVAL;
> > +
> > +             if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
> > +                                sizeof(nr_cgrps)))
> > +                     return -EFAULT;
> > +
> > +             if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
> > +                     return -EINVAL;
> > +
> > +             cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
> > +
> > +             cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
> > +             if (cgrp_buf == NULL)
> > +                     return -ENOMEM;
> > +
> > +             if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
> > +                                cgrp_bufsz)) {
> > +                     kfree(cgrp_buf);
> > +                     return -EFAULT;
> > +             }
> > +
> > +             ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
> > +
> > +             kfree(cgrp_buf);
> > +             return ret;
> > +     }
> > +
> > +     case PERF_EVENT_IOC_READ_CGROUP: {
> > +             u64 read_size, cgrp_id;
> > +
> > +             if (!event_can_attach_cgroup(event))
> > +                     return -EINVAL;
> > +
> > +             if (copy_from_user(&read_size, (u64 __user *)arg,
> > +                                sizeof(read_size)))
> > +                     return -EFAULT;
> > +             if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
> > +                                sizeof(cgrp_id)))
> > +                     return -EFAULT;
> > +
> > +             return perf_event_read_cgroup_node(event, read_size, cgrp_id,
> > +                                                (char __user *)(arg + 16));
> > +     }
> > +
> >       default:
> >               return -ENOTTY;
> >       }
> > @@ -5583,10 +5970,39 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> >       return 0;
> >  }
> >
> > +static void perf_sched_enable(void)
> > +{
> > +     /*
> > +      * We need the mutex here because static_branch_enable()
> > +      * must complete *before* the perf_sched_count increment
> > +      * becomes visible.
> > +      */
> > +     if (atomic_inc_not_zero(&perf_sched_count))
> > +             return;
> > +
> > +     mutex_lock(&perf_sched_mutex);
> > +     if (!atomic_read(&perf_sched_count)) {
> > +             static_branch_enable(&perf_sched_events);
> > +             /*
> > +              * Guarantee that all CPUs observe they key change and
> > +              * call the perf scheduling hooks before proceeding to
> > +              * install events that need them.
> > +              */
> > +             synchronize_rcu();
> > +     }
> > +     /*
> > +      * Now that we have waited for the sync_sched(), allow further
> > +      * increments to by-pass the mutex.
> > +      */
> > +     atomic_inc(&perf_sched_count);
> > +     mutex_unlock(&perf_sched_mutex);
> > +}
>
> Per the above, this is missing perf_sched_disable(). Also, this should
> probably be a separate patch then.

Got it.

>
> > +
> >  static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >       struct perf_event *event = file->private_data;
> >       struct perf_event_context *ctx;
> > +     bool do_sched_enable = false;
> >       long ret;
> >
> >       /* Treat ioctl like writes as it is likely a mutating operation. */
> > @@ -5595,9 +6011,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >               return ret;
> >
> >       ctx = perf_event_ctx_lock(event);
> > +     /* ATTACH_CGROUP requires context switch callback */
> > +     if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> > +             do_sched_enable = true;
> >       ret = _perf_ioctl(event, cmd, arg);
> >       perf_event_ctx_unlock(event, ctx);
> >
> > +     /*
> > +      * Due to the circular lock dependency, it cannot call
> > +      * static_branch_enable() under the ctx->mutex.
> > +      */
> > +     if (do_sched_enable && ret >= 0)
> > +             perf_sched_enable();
> > +
> >       return ret;
> >  }
>
> Hurmph... not much choice there I suppose.

Yeah, I couldn't find a better way.

Thanks a lot for your detailed review!
Namhyung

>
> > @@ -11240,33 +11666,8 @@ static void account_event(struct perf_event *event)
> >       if (event->attr.text_poke)
> >               atomic_inc(&nr_text_poke_events);
> >
> > -     if (inc) {
> > -             /*
> > -              * We need the mutex here because static_branch_enable()
> > -              * must complete *before* the perf_sched_count increment
> > -              * becomes visible.
> > -              */
> > -             if (atomic_inc_not_zero(&perf_sched_count))
> > -                     goto enabled;
> > -
> > -             mutex_lock(&perf_sched_mutex);
> > -             if (!atomic_read(&perf_sched_count)) {
> > -                     static_branch_enable(&perf_sched_events);
> > -                     /*
> > -                      * Guarantee that all CPUs observe they key change and
> > -                      * call the perf scheduling hooks before proceeding to
> > -                      * install events that need them.
> > -                      */
> > -                     synchronize_rcu();
> > -             }
> > -             /*
> > -              * Now that we have waited for the sync_sched(), allow further
> > -              * increments to by-pass the mutex.
> > -              */
> > -             atomic_inc(&perf_sched_count);
> > -             mutex_unlock(&perf_sched_mutex);
> > -     }
> > -enabled:
> > +     if (inc)
> > +             perf_sched_enable();
> >
> >       account_event_cpu(event, event->cpu);
> >
> > @@ -13008,6 +13409,7 @@ static void __init perf_event_init_all_cpus(void)
> >
> >  #ifdef CONFIG_CGROUP_PERF
> >               INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> > +             INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
> >  #endif
> >               INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
> >       }
> > @@ -13218,6 +13620,29 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
> >       return 0;
> >  }
> >
> > +static int __perf_cgroup_update_node(void *info)
> > +{
> > +     struct task_struct *task = info;
> > +
> > +     rcu_read_lock();
> > +     cgroup_node_sched_out(task);
> > +     rcu_read_unlock();
> > +
> > +     return 0;
> > +}
> > +
> > +/* update cgroup counter BEFORE task's cgroup is changed */
> > +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> > +{
> > +     struct task_struct *task;
> > +     struct cgroup_subsys_state *css;
> > +
> > +     cgroup_taskset_for_each(task, css, tset)
> > +             task_function_call(task, __perf_cgroup_update_node, task);
> > +
> > +     return 0;
> > +}
> > +
> >  static int __perf_cgroup_move(void *info)
> >  {
> >       struct task_struct *task = info;
> > @@ -13240,6 +13665,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
> >       .css_alloc      = perf_cgroup_css_alloc,
> >       .css_free       = perf_cgroup_css_free,
> >       .css_online     = perf_cgroup_css_online,
> > +     .can_attach     = perf_cgroup_can_attach,
> >       .attach         = perf_cgroup_attach,
> >       /*
> >        * Implicitly enable on dfl hierarchy so that perf events can
Namhyung Kim April 16, 2021, 10:18 a.m. UTC | #6
On Fri, Apr 16, 2021 at 6:27 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 08:48:12AM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
>
> > > > cgroup event counting (i.e. perf stat).
> > > >
> > > >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > > >      64-bit array to attach given cgroups.  The first element is a
> > > >      number of cgroups in the buffer, and the rest is a list of cgroup
> > > >      ids to add a cgroup info to the given event.
> > >
> > > WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> > > different ways?
> >
> > As you know, we already use cgroup-id for sampling.  Yeah we
> > can do it with the fd but one of the point in this patch is to reduce
> > the number of file descriptors. :)
>
> Well, I found those patches again after I wrote that. But I'm still not
> sure what a cgroup-id is from userspace.

It's a file handle that can be get from name_to_handle_at(2).

>
> How does userspace get one given a cgroup? (I actually mounted cgroupfs
> in order to see if there's some new 'id' file to read, there is not)
> Does having the cgroup-id ensure the cgroup exists? Can the cgroup-id
> get re-used?

It doesn't guarantee the existence of the cgroup as far as I know.
The cgroup can go away anytime.  Actually it doesn't matter for
this interface as users will get 0 result for them.  So I didn't check
the validity of the cgroup-id in the code.

And I don't think the cgroup-id is reused without reboot at least
for 64 bit systems.  It's came from a 64 bit integer increased
when a new cgroup is created.  Tejun?

>
> I really don't konw what the thing is. I don't use cgroups, like ever,
> except when I'm forced to due to some regression or bugreport.

I hope I made it clear.

>
> > Also, having cgroup-id is good to match with the result (from read)
> > as it contains the cgroup information.
>
> What?

I mean we need to match the result to a cgroup.  Either by passing
cgroup-id through ioctl or add the info in the read format.

>
> > > >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > > >      array to get the event counter values.  The first element is size
> > > >      of the array in byte, and the second element is a cgroup id to
> > > >      read.  The rest is to save the counter value and timings.
> > >
> > > :-(
> > >
> > > So basically you're doing a whole seconds cgroup interface, one that
> > > violates the one counter per file premise and lives off of ioctl()s.
> >
> > Right, but I'm not sure that we really want a separate event for each
> > cgroup if underlying hardware events are all the same.
>
> Sure, I see where you're coming from; I just don't much like where it
> got you :-)

Ok, let's make it better. :-)

>
> > > *IF* we're going to do something like this, I feel we should explore the
> > > whole vector-per-fd concept before proceeding. Can we make it less yuck
> > > (less special ioctl() and more regular file ops. Can we apply the
> > > concept to more things?
> >
> > Ideally it'd do without keeping file descriptors open.  Maybe we can make
> > the vector accept various types like vector-per-cgroup_id or so.
>
> So I think we've had proposals for being able to close fds in the past;
> while preserving groups etc. We've always pushed back on that because of
> the resource limit issue. By having each counter be a filedesc we get a
> natural limit on the amount of resources you can consume. And in that
> respect, having to use 400k fds is things working as designed.
>
> Anyway, there might be a way around this..

It's not just a file descriptor problem.  By having each counter per cgroup
it should pay the price on multiplexing or event scheduling.  That caused
serious performance problems in production environment so we had
to limit the number of cgroups monitored at a time.

>
> > > The second patch extends the ioctl() to be more read() like, instead of
> > > doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> > > or whatever. In fact, this whole second ioctl() doesn't make sense to
> > > have if we do indeed want to do vector-per-fd.
> >
> > One of the upside of the ioctl() is that we can pass cgroup-id to read.
> > Probably we can keep the index in the vector and set the file offset
> > with it.  Or else just read the whole vector, and then it has a cgroup-id
> > in the output like PERF_FORMAT_CGROUP?
> >
> > >
> > > Also, I suppose you can already fake this, by having a
> > > SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
> >
> > Thanks!
> >
> > > with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> > > group with a bunch of events. Then the buffer will fill with the values
> > > you use here.
> >
> > Right, I'll do an experiment with it.
> >
> > >
> > > Yes, I suppose it has higher overhead, but you get the data you want
> > > without having to do terrible things like this.
> >
> > That's true.  And we don't need many things in the perf record like
> > synthesizing task/mmap info.  Also there's a risk we can miss some
> > samples for some reason.
> >
> > Another concern is that it'd add huge slow down in the perf event
> > open as it creates a mixed sw/hw group.  The synchronized_rcu in
> > the move_cgroup path caused significant problems in my
> > environment as it adds up in proportion to the number of cpus.
>
> Since when is perf_event_open() a performance concern? That thing is
> slow in all possible ways.

I found that perf record is stuck in the perf_event_open for several
*seconds*.  And the worst part is that it'll prevent other users from
moving forward as it holds the ctx->mutex.  So I got some complaints
from random users for their mysterious slow down in the perf (like
in open, enable, disable and close).

Thanks,
Namhyung
Namhyung Kim April 16, 2021, 10:19 a.m. UTC | #7
On Fri, Apr 16, 2021 at 6:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Duh.. this is a half-finished email I meant to save for later. Anyway,
> I'll reply more.

Nevermind, and thanks for your time! :-)

Namhyung
Peter Zijlstra April 16, 2021, 10:27 a.m. UTC | #8
On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:

> > So I think we've had proposals for being able to close fds in the past;
> > while preserving groups etc. We've always pushed back on that because of
> > the resource limit issue. By having each counter be a filedesc we get a
> > natural limit on the amount of resources you can consume. And in that
> > respect, having to use 400k fds is things working as designed.
> > 
> > Anyway, there might be a way around this..

So how about we flip the whole thing sideways, instead of doing one
event for multiple cgroups, do an event for multiple-cpus.

Basically, allow:

	perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);

Which would have the kernel create nr_cpus events [the corrolary is that
we'd probably also allow: (.pid=-1, cpu=-1) ].

Output could be done by adding FORMAT_PERCPU, which takes the current
read() format and writes a copy for each CPU event. (p)read(v)() could
be used to explode or partial read that.

This gets rid of the nasty variadic nature of the
'get-me-these-n-cgroups'. While still getting rid of the n*m fd issue
you're facing.
Namhyung Kim April 16, 2021, 11:22 a.m. UTC | #9
On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
>
> > > So I think we've had proposals for being able to close fds in the past;
> > > while preserving groups etc. We've always pushed back on that because of
> > > the resource limit issue. By having each counter be a filedesc we get a
> > > natural limit on the amount of resources you can consume. And in that
> > > respect, having to use 400k fds is things working as designed.
> > >
> > > Anyway, there might be a way around this..
>
> So how about we flip the whole thing sideways, instead of doing one
> event for multiple cgroups, do an event for multiple-cpus.
>
> Basically, allow:
>
>         perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
>
> Which would have the kernel create nr_cpus events [the corrolary is that
> we'd probably also allow: (.pid=-1, cpu=-1) ].

Do you mean it'd have separate perf_events per cpu internally?
From a cpu's perspective, there's nothing changed, right?
Then it will have the same performance problem as of now.

>
> Output could be done by adding FORMAT_PERCPU, which takes the current
> read() format and writes a copy for each CPU event. (p)read(v)() could
> be used to explode or partial read that.

Yeah, I think it's good for read.  But what about mmap?
I don't think we can use file offset since it's taken for auxtrace.
Maybe we can simply disallow that..

>
> This gets rid of the nasty variadic nature of the
> 'get-me-these-n-cgroups'. While still getting rid of the n*m fd issue
> you're facing.

As I said, it's not just a file descriptor problem.  In fact, performance
is more concerning.

Thanks,
Namhyung
Peter Zijlstra April 16, 2021, 11:59 a.m. UTC | #10
On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> >
> > > > So I think we've had proposals for being able to close fds in the past;
> > > > while preserving groups etc. We've always pushed back on that because of
> > > > the resource limit issue. By having each counter be a filedesc we get a
> > > > natural limit on the amount of resources you can consume. And in that
> > > > respect, having to use 400k fds is things working as designed.
> > > >
> > > > Anyway, there might be a way around this..
> >
> > So how about we flip the whole thing sideways, instead of doing one
> > event for multiple cgroups, do an event for multiple-cpus.
> >
> > Basically, allow:
> >
> >         perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> >
> > Which would have the kernel create nr_cpus events [the corrolary is that
> > we'd probably also allow: (.pid=-1, cpu=-1) ].
> 
> Do you mean it'd have separate perf_events per cpu internally?
> From a cpu's perspective, there's nothing changed, right?
> Then it will have the same performance problem as of now.

Yes, but we'll not end up in ioctl() hell. The interface is sooo much
better. The performance thing just means we need to think harder.

I thought cgroup scheduling got a lot better with the work Ian did a
while back? What's the actual bottleneck now?

> > Output could be done by adding FORMAT_PERCPU, which takes the current
> > read() format and writes a copy for each CPU event. (p)read(v)() could
> > be used to explode or partial read that.
> 
> Yeah, I think it's good for read.  But what about mmap?
> I don't think we can use file offset since it's taken for auxtrace.
> Maybe we can simply disallow that..

Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP
like thing for mmap(), but I never implemented that (didn't get the
enthousiatic response I thought it would). But yeah, there's nowhere
near enough space in there for PERCPU.

Not sure how to do that, these counters must not be sampling counters
because we can't be sharing a buffer from multiple CPUs, so data/aux
just isn't a concern. But it's weird to have them magically behave
differently.
Namhyung Kim April 16, 2021, 12:19 p.m. UTC | #11
On Fri, Apr 16, 2021 at 8:59 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> > On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> > >
> > > > > So I think we've had proposals for being able to close fds in the past;
> > > > > while preserving groups etc. We've always pushed back on that because of
> > > > > the resource limit issue. By having each counter be a filedesc we get a
> > > > > natural limit on the amount of resources you can consume. And in that
> > > > > respect, having to use 400k fds is things working as designed.
> > > > >
> > > > > Anyway, there might be a way around this..
> > >
> > > So how about we flip the whole thing sideways, instead of doing one
> > > event for multiple cgroups, do an event for multiple-cpus.
> > >
> > > Basically, allow:
> > >
> > >         perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> > >
> > > Which would have the kernel create nr_cpus events [the corrolary is that
> > > we'd probably also allow: (.pid=-1, cpu=-1) ].
> >
> > Do you mean it'd have separate perf_events per cpu internally?
> > From a cpu's perspective, there's nothing changed, right?
> > Then it will have the same performance problem as of now.
>
> Yes, but we'll not end up in ioctl() hell. The interface is sooo much
> better. The performance thing just means we need to think harder.
>
> I thought cgroup scheduling got a lot better with the work Ian did a
> while back? What's the actual bottleneck now?

Yep, that's true but it still comes with a high cost of multiplexing in
context (cgroup) switch.  It's inefficient that it programs the PMU
with exactly the same config just for a different cgroup.  You know
accessing the MSRs is no cheap operation.

>
> > > Output could be done by adding FORMAT_PERCPU, which takes the current
> > > read() format and writes a copy for each CPU event. (p)read(v)() could
> > > be used to explode or partial read that.
> >
> > Yeah, I think it's good for read.  But what about mmap?
> > I don't think we can use file offset since it's taken for auxtrace.
> > Maybe we can simply disallow that..
>
> Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP
> like thing for mmap(), but I never implemented that (didn't get the
> enthousiatic response I thought it would). But yeah, there's nowhere
> near enough space in there for PERCPU.

Recently there's a patch to do it with rdpmc which needs to mmap first.

https://lore.kernel.org/lkml/20210414155412.3697605-1-robh@kernel.org/

>
> Not sure how to do that, these counters must not be sampling counters
> because we can't be sharing a buffer from multiple CPUs, so data/aux
> just isn't a concern. But it's weird to have them magically behave
> differently.

Yeah it's weird, and we should limit the sampling use case.

Thanks,
Namhyung
Peter Zijlstra April 16, 2021, 1:39 p.m. UTC | #12
On Fri, Apr 16, 2021 at 09:19:08PM +0900, Namhyung Kim wrote:
> > Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP
> > like thing for mmap(), but I never implemented that (didn't get the
> > enthousiatic response I thought it would). But yeah, there's nowhere
> > near enough space in there for PERCPU.
> 
> Recently there's a patch to do it with rdpmc which needs to mmap first.
> 
> https://lore.kernel.org/lkml/20210414155412.3697605-1-robh@kernel.org/

Yeah, I'm not sure about that, I've not looked at it recently. The thing
is though; for RDPMC to work, you *NEED* to be on the same CPU as the
counter.

The typical RDPMC use-case is self monitoring.
Stephane Eranian April 20, 2021, 8:34 a.m. UTC | #13
Hi Peter,

On Thu, Apr 15, 2021 at 7:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
>
> git grep "This patch" Documentation/
>
> > cgroup event counting (i.e. perf stat).
> >
> >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >      64-bit array to attach given cgroups.  The first element is a
> >      number of cgroups in the buffer, and the rest is a list of cgroup
> >      ids to add a cgroup info to the given event.
>
> WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> different ways?
>
> >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >      array to get the event counter values.  The first element is size
> >      of the array in byte, and the second element is a cgroup id to
> >      read.  The rest is to save the counter value and timings.
>
> :-(
>
> So basically you're doing a whole seconds cgroup interface, one that
> violates the one counter per file premise and lives off of ioctl()s.
>
> *IF* we're going to do something like this, I feel we should explore the
> whole vector-per-fd concept before proceeding. Can we make it less yuck
> (less special ioctl() and more regular file ops. Can we apply the
> concept to more things?
>
> The second patch extends the ioctl() to be more read() like, instead of
> doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> or whatever. In fact, this whole second ioctl() doesn't make sense to
> have if we do indeed want to do vector-per-fd.
>
> Also, I suppose you can already fake this, by having a
> SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
> with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> group with a bunch of events. Then the buffer will fill with the values
> you use here.
>
> Yes, I suppose it has higher overhead, but you get the data you want
> without having to do terrible things like this.
>
The sampling approach will certainly incur more overhead and be at
risk of losing the ability to
reconstruct the total counter per-cgroup, unless  you set the period
for SW_CGROUP_SWITCHES to
1. But then, you run the risk of losing samples if the buffer is full
or sampling is throtlled.
In some scenarios, we believe the number of context switches between
cgroup could be quite high (>> 1000/s).
And on top you would have to add the processing of the samples to
extract the counts per cgroup. That would require
a synthesis on cgroup on perf record and some post-processing on perf
report. We are interested in using the data live
to make some policy decisions, so a counting approach with perf stat
will always be best.

The fundamental problem Namhyung is trying to solve is the following:

num_fds = num_cpus x num_events x num_cgroups

On an 256-CPU AMD server running 200 cgroups with 6 events/cgroup (as
an example):

num_fds = 256 x 200 x 6 = 307,200 fds (with all the kernel memory
associated with them).
On each CPU, that implies: 200 x 6 = 1200 events to schedule and 6 to
find on each cgroup switch

This does not scale for us:
   - run against the fd limit, but also memory consumption in the
kernel per struct file, struct inode, struct perf_event ....
   - number of events per-cpu is still also large
   - require event scheduling on cgroup switches, even with RB-tree
improvements, still heavy
   - require event scheduling even if measuring the same events across
all cgroups

One factor in that equation above needs to disappear. The one counter
per file descriptor is respected with
Nahmyung's patch because he is operating a plain per-cpu mode. What
changes is just how and where the
count is accumulated in perf_events. The resulting programming on the
hardware is the same as before.

What is needed is a way to accumulate counts per-cgroup without
incurring all this overhead. That will
inevitably introduce another way of specifying cgroups. The current
mode offers maximum flexibility.
You can specify any event per-cgroup. Cgroup events are programmed
independently of each other. The approach
proposed by Namhyung still allows for that, but provides an
optimization for the common case where all events are the
same across all cgroups because in that case you only create: num_fds
= num_cpus x num_events. The advantage is that
you eliminate a lot of the overhead listed above. In particular the
fds and the context switch scheduling, now you only have
to compute a delta and store in a hash table.

As you point out, the difficulty is how to express the cgroups of
interest and how to read the counts back.
I agree that the ioctl() is not ideal for the latter. For the former,
if you do not want ioctl() then you would have to overload
perf_event_open() with a vector of cgroup fd, for instance. As for the
read, you could, as you suggest, use
the read syscall if you want to read all the cgroups at once using a
new read_format. I don't have a problem with that.
As for cgroup-id vs. cgroup-fd, I think you make a fair point about
consistency with the existing approach. I don't have a problem
with that either

Thanks.


>
>
>
> Lots of random comments below.
>
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <tj@kernel.org>
> > Reported-by: kernel test robot <lkp@intel.com>
>
> What, the whole thing?
>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> >  include/linux/perf_event.h      |  22 ++
> >  include/uapi/linux/perf_event.h |   2 +
> >  kernel/events/core.c            | 480 ++++++++++++++++++++++++++++++--
> >  3 files changed, 477 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..4b03cbadf4a0 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,19 @@ struct perf_event {
> >
> >  #ifdef CONFIG_CGROUP_PERF
> >       struct perf_cgroup              *cgrp; /* cgroup event is attach to */
> > +
> > +     /* to share an event for multiple cgroups */
> > +     struct hlist_head               *cgrp_node_hash;
> > +     struct perf_cgroup_node         *cgrp_node_entries;
> > +     int                             nr_cgrp_nodes;
> > +     int                             cgrp_node_hash_bits;
> > +
> > +     struct list_head                cgrp_node_entry;
>
> Not related to perf_cgroup_node below, afaict the name is just plain
> wrong.
>
> > +
> > +     /* snapshot of previous reading (for perf_cgroup_node below) */
> > +     u64                             cgrp_node_count;
> > +     u64                             cgrp_node_time_enabled;
> > +     u64                             cgrp_node_time_running;
> >  #endif
> >
> >  #ifdef CONFIG_SECURITY
> > @@ -780,6 +793,13 @@ struct perf_event {
> >  #endif /* CONFIG_PERF_EVENTS */
> >  };
> >
> > +struct perf_cgroup_node {
> > +     struct hlist_node               node;
> > +     u64                             id;
> > +     u64                             count;
> > +     u64                             time_enabled;
> > +     u64                             time_running;
> > +} ____cacheline_aligned;
> >
> >  struct perf_event_groups {
> >       struct rb_root  tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> >       int                             pin_count;
> >  #ifdef CONFIG_CGROUP_PERF
> >       int                             nr_cgroups;      /* cgroup evts */
> > +     struct list_head                cgrp_node_list;
>
> AFAICT this is actually a list of events, not a list of cgroup_node
> thingies, hence the name is wrong.
>
> > +     struct list_head                cgrp_ctx_entry;
> >  #endif
> >       void                            *task_ctx_data; /* pmu specific data */
> >       struct rcu_head                 rcu_head;
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index ad15e40d7f5d..06bc7ab13616 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> >  #define PERF_EVENT_IOC_PAUSE_OUTPUT          _IOW('$', 9, __u32)
> >  #define PERF_EVENT_IOC_QUERY_BPF             _IOWR('$', 10, struct perf_event_query_bpf *)
> >  #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES     _IOW('$', 11, struct perf_event_attr *)
> > +#define PERF_EVENT_IOC_ATTACH_CGROUP         _IOW('$', 12, __u64 *)
> > +#define PERF_EVENT_IOC_READ_CGROUP           _IOWR('$', 13, __u64 *)
> >
> >  enum perf_event_ioc_flags {
> >       PERF_IOC_FLAG_GROUP             = 1U << 0,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f07943183041..bcf51c0b7855 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -379,6 +379,7 @@ enum event_type_t {
> >   * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> >   */
> >
> > +static void perf_sched_enable(void);
> >  static void perf_sched_delayed(struct work_struct *work);
> >  DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> >  static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> > @@ -2124,6 +2125,323 @@ static int perf_get_aux_event(struct perf_event *event,
> >       return 1;
> >  }
> >
> > +#ifdef CONFIG_CGROUP_PERF
> > +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> > +
> > +static bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > +     if (is_sampling_event(event))
> > +             return false;
> > +     if (event->attach_state & PERF_ATTACH_TASK)
> > +             return false;
> > +     if (is_cgroup_event(event))
> > +             return false;
>
> Why? You could be doing a subtree.
>
> > +
> > +     return true;
> > +}
> > +
> > +static bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > +     return event->nr_cgrp_nodes > 0;
> > +}
> > +
> > +static struct perf_cgroup_node *
> > +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> > +
> > +     hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> > +             if (cgrp_node->id == cgrp_id)
> > +                     return cgrp_node;
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     u64 delta_count, delta_time_enabled, delta_time_running;
> > +     int i;
> > +
> > +     if (event->cgrp_node_count == 0)
> > +             goto out;
> > +
> > +     delta_count = local64_read(&event->count) - event->cgrp_node_count;
> > +     delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > +     delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > +
> > +     /* account delta to all ancestor cgroups */
> > +     for (i = 0; i <= cgrp->level; i++) {
> > +             struct perf_cgroup_node *node;
> > +
> > +             node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > +             if (node) {
> > +                     node->count += delta_count;
> > +                     node->time_enabled += delta_time_enabled;
> > +                     node->time_running += delta_time_running;
> > +             }
> > +     }
> > +
> > +out:
> > +     event->cgrp_node_count = local64_read(&event->count);
> > +     event->cgrp_node_time_enabled = event->total_time_enabled;
> > +     event->cgrp_node_time_running = event->total_time_running;
>
> This is wrong; there's no guarantee these are the same values you read
> at the begin, IOW you could be loosing events.
>
> > +}
> > +
> > +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > +     if (event->state == PERF_EVENT_STATE_ACTIVE)
> > +             event->pmu->read(event);
> > +
> > +     perf_event_update_time(event);
> > +     perf_update_cgroup_node(event, cgrp);
> > +}
> > +
> > +/* this is called from context switch */
> > +static void update_cgroup_node_events(struct perf_event_context *ctx,
> > +                                   struct cgroup *cgrp)
> > +{
> > +     struct perf_event *event;
> > +
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     if (ctx->is_active & EVENT_TIME)
> > +             update_context_time(ctx);
> > +
> > +     list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> > +             update_cgroup_node(event, cgrp);
> > +}
> > +
> > +static void cgroup_node_sched_out(struct task_struct *task)
>
> Naming seems confused.
>
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> > +     struct perf_event_context *ctx;
> > +
> > +     list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> > +             raw_spin_lock(&ctx->lock);
> > +             update_cgroup_node_events(ctx, cgrp->css.cgroup);
> > +             raw_spin_unlock(&ctx->lock);
> > +     }
> > +}
> > +
> > +/* these are called from the when an event is enabled/disabled */
>
> That sentence needs help.
>
> > +static void perf_add_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +     bool is_first;
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
>
> The latter very much implies the former, no?
>
> > +
> > +     is_first = list_empty(&ctx->cgrp_node_list);
> > +     list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
>
> See the naming being daft.
>
> > +
> > +     if (is_first)
> > +             list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
>
> While here it actually makes sense.
>
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
> > +}
> > +
> > +static void perf_del_cgrp_node_list(struct perf_event *event,
> > +                                 struct perf_event_context *ctx)
> > +{
> > +     struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +
> > +     lockdep_assert_irqs_disabled();
> > +     lockdep_assert_held(&ctx->lock);
> > +
> > +     update_cgroup_node(event, cgrp->css.cgroup);
> > +     /* to refresh delta when it's enabled */
> > +     event->cgrp_node_count = 0;
> > +
> > +     list_del(&event->cgrp_node_entry);
> > +
> > +     if (list_empty(&ctx->cgrp_node_list))
> > +             list_del(&ctx->cgrp_ctx_entry);
> > +}
> > +
> > +static void perf_attach_cgroup_node(struct perf_event *event,
> > +                                 struct perf_cpu_context *cpuctx,
> > +                                 struct perf_event_context *ctx,
> > +                                 void *data)
> > +{
> > +     if (ctx->is_active & EVENT_TIME)
> > +             update_context_time(ctx);
> > +
> > +     perf_add_cgrp_node_list(event, ctx);
> > +}
> > +
> > +#define MIN_CGRP_NODE_HASH  4
> > +#define MAX_CGRP_NODE_HASH  (4 * 1024)
>
> So today you think 200 cgroups is sane, tomorrow you'll complain 4k
> cgroups is not enough.
>
> > +
> > +/* this is called from ioctl */
> > +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> > +                                      u64 *cgroup_ids)
> > +{
> > +     struct perf_cgroup_node *cgrp_node;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     struct hlist_head *cgrp_node_hash;
> > +     int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
>
> How many more copies of that do we need?
>
> > +     unsigned long flags;
> > +     bool is_first = true;
> > +     bool enabled;
> > +     int i, nr_hash;
> > +     int hash_bits;
> > +
> > +     if (nr_cgrps < MIN_CGRP_NODE_HASH)
> > +             nr_hash = MIN_CGRP_NODE_HASH;
> > +     else
> > +             nr_hash = roundup_pow_of_two(nr_cgrps);
> > +     hash_bits = ilog2(nr_hash);
>
> That's like the complicated version of:
>
>         hash_bits = 1 + ilog2(max(MIN_CGRP_NODE_HASH, nr_cgrps) - 1);
>
> ?
>
> > +
> > +     cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> > +                                   GFP_KERNEL, node);
> > +     if (cgrp_node_hash == NULL)
> > +             return -ENOMEM;
> > +
> > +     cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> > +     if (cgrp_node == NULL) {
> > +             kfree(cgrp_node_hash);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     for (i = 0; i < (int)nr_cgrps; i++) {
> > +             int key = hash_64(cgroup_ids[i], hash_bits);
> > +
> > +             cgrp_node[i].id = cgroup_ids[i];
> > +             hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> > +     }
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> > +
> > +     if (event->nr_cgrp_nodes != 0) {
> > +             kfree(event->cgrp_node_hash);
> > +             kfree(event->cgrp_node_entries);
> > +             is_first = false;
> > +     }
>
> So if we already had cgroups attached, we just plunk whatever state we
> had, without re-hashing? That's hardly sane semantics for something
> called 'attach'.
>
> And if you want this behaviour, then you should probably also accept
> nr_cgrps==0, but you don't do that either.
>
> > +
> > +     event->cgrp_node_hash = cgrp_node_hash;
> > +     event->cgrp_node_entries = cgrp_node;
> > +     event->cgrp_node_hash_bits = hash_bits;
> > +     event->nr_cgrp_nodes = nr_cgrps;
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     if (is_first && enabled)
> > +             event_function_call(event, perf_attach_cgroup_node, NULL);
> > +
> > +     return 0;
> > +}
> > +
> > +static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
> > +{
> > +     struct perf_event_context *ctx = event->ctx;
> > +     unsigned long flags;
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     if (event_has_cgroup_node(event)) {
> > +             if (!atomic_add_unless(&perf_sched_count, -1, 1))
> > +                     schedule_delayed_work(&perf_sched_work, HZ);
> > +     }
>
> Below you extract perf_sched_enable(), so this is somewhat inconsistent
> for not being perf_sched_disable() I'm thinking.
>
> Also, the placement seems weird, do you really want this under
> ctx->lock?
>
> > +
> > +     kfree(event->cgrp_node_hash);
> > +     kfree(event->cgrp_node_entries);
> > +     event->nr_cgrp_nodes = 0;
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +}
> > +
> > +static int perf_event_read(struct perf_event *event, bool group);
> > +
> > +static void __perf_read_cgroup_node(struct perf_event *event)
> > +{
> > +     struct perf_cgroup *cgrp;
> > +
> > +     if (event_has_cgroup_node(event)) {
> > +             cgrp = perf_cgroup_from_task(current, NULL);
> > +             perf_update_cgroup_node(event, cgrp->css.cgroup);
> > +     }
> > +}
> > +
> > +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> > +                                    u64 cgrp_id, char __user *buf)
> > +{
> > +     struct perf_cgroup_node *cgrp;
> > +     struct perf_event_context *ctx = event->ctx;
> > +     unsigned long flags;
> > +     u64 read_format = event->attr.read_format;
> > +     u64 values[4];
> > +     int n = 0;
> > +
> > +     /* update event count and times (possibly run on other cpu) */
> > +     (void)perf_event_read(event, false);
> > +
> > +     raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > +     cgrp = find_cgroup_node(event, cgrp_id);
> > +     if (cgrp == NULL) {
> > +             raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +             return -ENOENT;
> > +     }
> > +
> > +     values[n++] = cgrp->count;
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> > +             values[n++] = cgrp->time_enabled;
> > +     if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> > +             values[n++] = cgrp->time_running;
> > +     if (read_format & PERF_FORMAT_ID)
> > +             values[n++] = primary_event_id(event);
> > +
> > +     raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +     if (copy_to_user(buf, values, n * sizeof(u64)))
> > +             return -EFAULT;
> > +
> > +     return n * sizeof(u64);
> > +}
> > +#else  /* !CONFIG_CGROUP_PERF */
> > +static inline bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > +     return false;
> > +}
> > +
> > +static inline bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > +     return false;
> > +}
> > +
> > +static inline void cgroup_node_sched_out(struct task_struct *task) {}
> > +
> > +static inline void perf_add_cgrp_node_list(struct perf_event *event,
> > +                                        struct perf_event_context *ctx) {}
> > +static inline void perf_del_cgrp_node_list(struct perf_event *event,
> > +                                        struct perf_event_context *ctx) {}
> > +
> > +#define MAX_CGRP_NODE_HASH  1
> > +static inline int perf_event_attach_cgroup_node(struct perf_event *event,
> > +                                             u64 nr_cgrps, u64 *cgrp_ids)
> > +{
> > +     return -ENODEV;
> > +}
> > +
> > +static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
> > +static inline  void __perf_read_cgroup_node(struct perf_event *event) {}
> > +
> > +static inline int perf_event_read_cgroup_node(struct perf_event *event,
> > +                                           u64 read_size, u64 cgrp_id,
> > +                                           char __user *buf)
> > +{
> > +     return -EINVAL;
> > +}
> > +#endif  /* CONFIG_CGROUP_PERF */
> > +
> >  static inline struct list_head *get_event_list(struct perf_event *event)
> >  {
> >       struct perf_event_context *ctx = event->ctx;
> > @@ -2407,6 +2725,7 @@ static void __perf_event_disable(struct perf_event *event,
> >
> >       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> >       perf_cgroup_event_disable(event, ctx);
> > +     perf_del_cgrp_node_list(event, ctx);
> >  }
> >
> >  /*
> > @@ -2946,6 +3265,7 @@ static void __perf_event_enable(struct perf_event *event,
> >
> >       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
> >       perf_cgroup_event_enable(event, ctx);
> > +     perf_add_cgrp_node_list(event, ctx);
> >
> >       if (!ctx->is_active)
> >               return;
> > @@ -3568,6 +3888,13 @@ void __perf_event_task_sched_out(struct task_struct *task,
> >        */
> >       if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
> >               perf_cgroup_sched_out(task, next);
> > +
> > +#ifdef CONFIG_CGROUP_PERF
> > +     if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
> > +         perf_cgroup_from_task(task, NULL) !=
> > +         perf_cgroup_from_task(next, NULL))
> > +             cgroup_node_sched_out(task);
> > +#endif
>
> Please, fold this into that one cgroup branch you already have here.
> Don't pullute things further.
>
> >  }
> >
> >  /*
> > @@ -4268,6 +4595,7 @@ static void __perf_event_read(void *info)
> >
> >       if (!data->group) {
> >               pmu->read(event);
> > +             __perf_read_cgroup_node(event);
> >               data->ret = 0;
> >               goto unlock;
> >       }
> > @@ -4283,6 +4611,7 @@ static void __perf_event_read(void *info)
> >                        * sibling could be on different (eg: software) PMU.
> >                        */
> >                       sub->pmu->read(sub);
> > +                     __perf_read_cgroup_node(sub);
> >               }
> >       }
> >
>
> Why though; nothing here looks at the new cgroup state.
>
> > @@ -4462,6 +4791,10 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
> >       INIT_LIST_HEAD(&ctx->pinned_active);
> >       INIT_LIST_HEAD(&ctx->flexible_active);
> >       refcount_set(&ctx->refcount, 1);
> > +#ifdef CONFIG_CGROUP_PERF
> > +     INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> > +     INIT_LIST_HEAD(&ctx->cgrp_node_list);
> > +#endif
> >  }
> >
> >  static struct perf_event_context *
> > @@ -4851,6 +5184,8 @@ static void _free_event(struct perf_event *event)
> >       if (is_cgroup_event(event))
> >               perf_detach_cgroup(event);
> >
> > +     perf_event_destroy_cgroup_nodes(event);
> > +
> >       if (!event->parent) {
> >               if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> >                       put_callchain_buffers();
> > @@ -5571,6 +5906,58 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> >
> >               return perf_event_modify_attr(event,  &new_attr);
> >       }
> > +
> > +     case PERF_EVENT_IOC_ATTACH_CGROUP: {
> > +             u64 nr_cgrps;
> > +             u64 *cgrp_buf;
> > +             size_t cgrp_bufsz;
> > +             int ret;
> > +
> > +             if (!event_can_attach_cgroup(event))
> > +                     return -EINVAL;
> > +
> > +             if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
> > +                                sizeof(nr_cgrps)))
> > +                     return -EFAULT;
> > +
> > +             if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
> > +                     return -EINVAL;
> > +
> > +             cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
> > +
> > +             cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
> > +             if (cgrp_buf == NULL)
> > +                     return -ENOMEM;
> > +
> > +             if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
> > +                                cgrp_bufsz)) {
> > +                     kfree(cgrp_buf);
> > +                     return -EFAULT;
> > +             }
> > +
> > +             ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
> > +
> > +             kfree(cgrp_buf);
> > +             return ret;
> > +     }
> > +
> > +     case PERF_EVENT_IOC_READ_CGROUP: {
> > +             u64 read_size, cgrp_id;
> > +
> > +             if (!event_can_attach_cgroup(event))
> > +                     return -EINVAL;
> > +
> > +             if (copy_from_user(&read_size, (u64 __user *)arg,
> > +                                sizeof(read_size)))
> > +                     return -EFAULT;
> > +             if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
> > +                                sizeof(cgrp_id)))
> > +                     return -EFAULT;
> > +
> > +             return perf_event_read_cgroup_node(event, read_size, cgrp_id,
> > +                                                (char __user *)(arg + 16));
> > +     }
> > +
> >       default:
> >               return -ENOTTY;
> >       }
> > @@ -5583,10 +5970,39 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> >       return 0;
> >  }
> >
> > +static void perf_sched_enable(void)
> > +{
> > +     /*
> > +      * We need the mutex here because static_branch_enable()
> > +      * must complete *before* the perf_sched_count increment
> > +      * becomes visible.
> > +      */
> > +     if (atomic_inc_not_zero(&perf_sched_count))
> > +             return;
> > +
> > +     mutex_lock(&perf_sched_mutex);
> > +     if (!atomic_read(&perf_sched_count)) {
> > +             static_branch_enable(&perf_sched_events);
> > +             /*
> > +              * Guarantee that all CPUs observe they key change and
> > +              * call the perf scheduling hooks before proceeding to
> > +              * install events that need them.
> > +              */
> > +             synchronize_rcu();
> > +     }
> > +     /*
> > +      * Now that we have waited for the sync_sched(), allow further
> > +      * increments to by-pass the mutex.
> > +      */
> > +     atomic_inc(&perf_sched_count);
> > +     mutex_unlock(&perf_sched_mutex);
> > +}
>
> Per the above, this is missing perf_sched_disable(). Also, this should
> probably be a separate patch then.
>
> > +
> >  static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >       struct perf_event *event = file->private_data;
> >       struct perf_event_context *ctx;
> > +     bool do_sched_enable = false;
> >       long ret;
> >
> >       /* Treat ioctl like writes as it is likely a mutating operation. */
> > @@ -5595,9 +6011,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >               return ret;
> >
> >       ctx = perf_event_ctx_lock(event);
> > +     /* ATTACH_CGROUP requires context switch callback */
> > +     if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> > +             do_sched_enable = true;
> >       ret = _perf_ioctl(event, cmd, arg);
> >       perf_event_ctx_unlock(event, ctx);
> >
> > +     /*
> > +      * Due to the circular lock dependency, it cannot call
> > +      * static_branch_enable() under the ctx->mutex.
> > +      */
> > +     if (do_sched_enable && ret >= 0)
> > +             perf_sched_enable();
> > +
> >       return ret;
> >  }
>
> Hurmph... not much choice there I suppose.
>
> > @@ -11240,33 +11666,8 @@ static void account_event(struct perf_event *event)
> >       if (event->attr.text_poke)
> >               atomic_inc(&nr_text_poke_events);
> >
> > -     if (inc) {
> > -             /*
> > -              * We need the mutex here because static_branch_enable()
> > -              * must complete *before* the perf_sched_count increment
> > -              * becomes visible.
> > -              */
> > -             if (atomic_inc_not_zero(&perf_sched_count))
> > -                     goto enabled;
> > -
> > -             mutex_lock(&perf_sched_mutex);
> > -             if (!atomic_read(&perf_sched_count)) {
> > -                     static_branch_enable(&perf_sched_events);
> > -                     /*
> > -                      * Guarantee that all CPUs observe they key change and
> > -                      * call the perf scheduling hooks before proceeding to
> > -                      * install events that need them.
> > -                      */
> > -                     synchronize_rcu();
> > -             }
> > -             /*
> > -              * Now that we have waited for the sync_sched(), allow further
> > -              * increments to by-pass the mutex.
> > -              */
> > -             atomic_inc(&perf_sched_count);
> > -             mutex_unlock(&perf_sched_mutex);
> > -     }
> > -enabled:
> > +     if (inc)
> > +             perf_sched_enable();
> >
> >       account_event_cpu(event, event->cpu);
> >
> > @@ -13008,6 +13409,7 @@ static void __init perf_event_init_all_cpus(void)
> >
> >  #ifdef CONFIG_CGROUP_PERF
> >               INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> > +             INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
> >  #endif
> >               INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
> >       }
> > @@ -13218,6 +13620,29 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
> >       return 0;
> >  }
> >
> > +static int __perf_cgroup_update_node(void *info)
> > +{
> > +     struct task_struct *task = info;
> > +
> > +     rcu_read_lock();
> > +     cgroup_node_sched_out(task);
> > +     rcu_read_unlock();
> > +
> > +     return 0;
> > +}
> > +
> > +/* update cgroup counter BEFORE task's cgroup is changed */
> > +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> > +{
> > +     struct task_struct *task;
> > +     struct cgroup_subsys_state *css;
> > +
> > +     cgroup_taskset_for_each(task, css, tset)
> > +             task_function_call(task, __perf_cgroup_update_node, task);
> > +
> > +     return 0;
> > +}
> > +
> >  static int __perf_cgroup_move(void *info)
> >  {
> >       struct task_struct *task = info;
> > @@ -13240,6 +13665,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
> >       .css_alloc      = perf_cgroup_css_alloc,
> >       .css_free       = perf_cgroup_css_free,
> >       .css_online     = perf_cgroup_css_online,
> > +     .can_attach     = perf_cgroup_can_attach,
> >       .attach         = perf_cgroup_attach,
> >       /*
> >        * Implicitly enable on dfl hierarchy so that perf events can
Peter Zijlstra April 20, 2021, 9:48 a.m. UTC | #14
On Tue, Apr 20, 2021 at 01:34:40AM -0700, Stephane Eranian wrote:
> The sampling approach will certainly incur more overhead and be at
> risk of losing the ability to
> reconstruct the total counter per-cgroup, unless  you set the period
> for SW_CGROUP_SWITCHES to
> 1. But then, you run the risk of losing samples if the buffer is full
> or sampling is throtlled.
> In some scenarios, we believe the number of context switches between
> cgroup could be quite high (>> 1000/s).
> And on top you would have to add the processing of the samples to
> extract the counts per cgroup. That would require
> a synthesis on cgroup on perf record and some post-processing on perf
> report. We are interested in using the data live
> to make some policy decisions, so a counting approach with perf stat
> will always be best.

Can you please configure your MUA to sanely (re)flow text? The above
random line-breaks are *so* painful to read.
Peter Zijlstra April 20, 2021, 10:28 a.m. UTC | #15
On Fri, Apr 16, 2021 at 06:49:09PM +0900, Namhyung Kim wrote:
> On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > > +{
> > > +     u64 delta_count, delta_time_enabled, delta_time_running;
> > > +     int i;
> > > +
> > > +     if (event->cgrp_node_count == 0)
> > > +             goto out;
> > > +
> > > +     delta_count = local64_read(&event->count) - event->cgrp_node_count;

From here...

> > > +     delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > > +     delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > > +
> > > +     /* account delta to all ancestor cgroups */
> > > +     for (i = 0; i <= cgrp->level; i++) {
> > > +             struct perf_cgroup_node *node;
> > > +
> > > +             node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > > +             if (node) {
> > > +                     node->count += delta_count;
> > > +                     node->time_enabled += delta_time_enabled;
> > > +                     node->time_running += delta_time_running;
> > > +             }
> > > +     }

... till here, NMI could hit and increment event->count, which then
means that:

> > > +
> > > +out:
> > > +     event->cgrp_node_count = local64_read(&event->count);

This load doesn't match the delta_count load and events will go missing.

Obviously correct solution is:

	event->cgrp_node_count += delta_count;


> > > +     event->cgrp_node_time_enabled = event->total_time_enabled;
> > > +     event->cgrp_node_time_running = event->total_time_running;

And while total_time doesn't have that problem, consistency would then
have you do:

	event->cgrp_node_time_foo += delta_time_foo;

> >
> > This is wrong; there's no guarantee these are the same values you read
> > at the begin, IOW you could be loosing events.
> 
> Could you please elaborate?

You forgot NMI.
Peter Zijlstra April 20, 2021, 11:28 a.m. UTC | #16
On Tue, Apr 20, 2021 at 01:34:40AM -0700, Stephane Eranian wrote:
> The sampling approach will certainly incur more overhead and be at
> risk of losing the ability to reconstruct the total counter
> per-cgroup, unless  you set the period for SW_CGROUP_SWITCHES to 1.
> But then, you run the risk of losing samples if the buffer is full or
> sampling is throtlled.  In some scenarios, we believe the number of
> context switches between cgroup could be quite high (>> 1000/s).  And
> on top you would have to add the processing of the samples to extract
> the counts per cgroup. That would require a synthesis on cgroup on
> perf record and some post-processing on perf report. We are interested
> in using the data live to make some policy decisions, so a counting
> approach with perf stat will always be best.
> 
> The fundamental problem Namhyung is trying to solve is the following:
> 
> num_fds = num_cpus x num_events x num_cgroups
> 
> On an 256-CPU AMD server running 200 cgroups with 6 events/cgroup (as
> an example):
> 
> num_fds = 256 x 200 x 6 = 307,200 fds (with all the kernel memory
> associated with them).

So the PERCPU proposal would reduce that to 200 * 6 = 1200 fds, which is
a definite win.

> On each CPU, that implies: 200 x 6 = 1200
> events to schedule and 6 to find on each cgroup switch

Right, so here we could optimize; if we find the event-groups are
identical in composition we can probably frob something that swizzles
the counts around without touching the PMU. That would be very similar
to what we already have in perf_event_context_sched_out().

This gets a wee bit tricky when you consider cgroup hierarchy though;
suppose you have:

           R
	  / \
	 A   B
	    / \
	   C   D

And are monitoring both B and D, then you'll end up with 12 counters
active instead of the 6. I'm not sure how to make that go away. 'Don't
do that then' might be good enough.

> This does not scale for us:
>    - run against the fd limit, but also memory consumption in the
>    kernel per struct file, struct inode, struct perf_event ....
>    - number of events per-cpu is still also large
>    - require event scheduling on cgroup switches, even with RB-tree
>    improvements, still heavy
>    - require event scheduling even if measuring the same events across
>    all cgroups
> 
> One factor in that equation above needs to disappear. The one counter
> per file descriptor is respected with Nahmyung's patch because he is
> operating a plain per-cpu mode. What changes is just how and where the
> count is accumulated in perf_events. The resulting programming on the
> hardware is the same as before.

Yes, you're aggregating differently. And that's exactly the problem. The
aggregation is a variable one with fairly poor semantics. Suppose you
create a new cgroup, then you have to tear down and recreate the whole
thing, which is pretty crap.

Ftrace had a similar issue; where people wanted aggregation, and that
resulted in the event histogram, which, quite frankla,y is a scary
monster that I've no intention of duplicating. That's half a programming
language implemented.

> As you point out, the difficulty is how to express the cgroups of
> interest and how to read the counts back.  I agree that the ioctl() is
> not ideal for the latter. For the former, if you do not want ioctl()
> then you would have to overload perf_event_open() with a vector of
> cgroup fd, for instance. As for the read, you could, as you suggest,
> use the read syscall if you want to read all the cgroups at once using
> a new read_format. I don't have a problem with that.  As for cgroup-id
> vs. cgroup-fd, I think you make a fair point about consistency with
> the existing approach. I don't have a problem with that either

So that is a problem of aggregation; which is basically a
programmability problem. You're asking for a variadic-fixed-function
now, but tomorrow someone else will come and want another one.
Namhyung Kim April 20, 2021, 6:37 p.m. UTC | #17
Hi Peter,

On Tue, Apr 20, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 06:49:09PM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > > > +{
> > > > +     u64 delta_count, delta_time_enabled, delta_time_running;
> > > > +     int i;
> > > > +
> > > > +     if (event->cgrp_node_count == 0)
> > > > +             goto out;
> > > > +
> > > > +     delta_count = local64_read(&event->count) - event->cgrp_node_count;
>
> From here...
>
> > > > +     delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > > > +     delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > > > +
> > > > +     /* account delta to all ancestor cgroups */
> > > > +     for (i = 0; i <= cgrp->level; i++) {
> > > > +             struct perf_cgroup_node *node;
> > > > +
> > > > +             node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > > > +             if (node) {
> > > > +                     node->count += delta_count;
> > > > +                     node->time_enabled += delta_time_enabled;
> > > > +                     node->time_running += delta_time_running;
> > > > +             }
> > > > +     }
>
> ... till here, NMI could hit and increment event->count, which then
> means that:
>
> > > > +
> > > > +out:
> > > > +     event->cgrp_node_count = local64_read(&event->count);
>
> This load doesn't match the delta_count load and events will go missing.
>
> Obviously correct solution is:
>
>         event->cgrp_node_count += delta_count;
>
>
> > > > +     event->cgrp_node_time_enabled = event->total_time_enabled;
> > > > +     event->cgrp_node_time_running = event->total_time_running;
>
> And while total_time doesn't have that problem, consistency would then
> have you do:
>
>         event->cgrp_node_time_foo += delta_time_foo;
>
> > >
> > > This is wrong; there's no guarantee these are the same values you read
> > > at the begin, IOW you could be loosing events.
> >
> > Could you please elaborate?
>
> You forgot NMI.

Thanks for your explanation.  Maybe I'm missing something but
this event is basically for counting and doesn't allow sampling.
Do you say it's affected by other sampling events?  Note that
it's not reading from the PMU here, what it reads is a snapshot
of last pmu->read(event) afaik.

Thanks,
Namhyung
Peter Zijlstra April 20, 2021, 6:43 p.m. UTC | #18
On Wed, Apr 21, 2021 at 03:37:11AM +0900, Namhyung Kim wrote:
> On Tue, Apr 20, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > You forgot NMI.
> 
> Thanks for your explanation.  Maybe I'm missing something but
> this event is basically for counting and doesn't allow sampling.
> Do you say it's affected by other sampling events?  Note that
> it's not reading from the PMU here, what it reads is a snapshot
> of last pmu->read(event) afaik.

Even !sampling events will trigger NMI to deal with short hardware
counters rolling over. But yes, !sampling can also be updated from NMI
by other events if they're in a group etc..

Basically, always assume NMI/PMI can happen.
Namhyung Kim April 21, 2021, 7:37 p.m. UTC | #19
On Tue, Apr 20, 2021 at 8:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 20, 2021 at 01:34:40AM -0700, Stephane Eranian wrote:
> > The sampling approach will certainly incur more overhead and be at
> > risk of losing the ability to reconstruct the total counter
> > per-cgroup, unless  you set the period for SW_CGROUP_SWITCHES to 1.
> > But then, you run the risk of losing samples if the buffer is full or
> > sampling is throtlled.  In some scenarios, we believe the number of
> > context switches between cgroup could be quite high (>> 1000/s).  And
> > on top you would have to add the processing of the samples to extract
> > the counts per cgroup. That would require a synthesis on cgroup on
> > perf record and some post-processing on perf report. We are interested
> > in using the data live to make some policy decisions, so a counting
> > approach with perf stat will always be best.
> >
> > The fundamental problem Namhyung is trying to solve is the following:
> >
> > num_fds = num_cpus x num_events x num_cgroups
> >
> > On an 256-CPU AMD server running 200 cgroups with 6 events/cgroup (as
> > an example):
> >
> > num_fds = 256 x 200 x 6 = 307,200 fds (with all the kernel memory
> > associated with them).
>
> So the PERCPU proposal would reduce that to 200 * 6 = 1200 fds, which is
> a definite win.

Sure.  It's good for the fd reduction.  But it won't help event scheduling
on a cpu which is a more important problem for us.

>
> > On each CPU, that implies: 200 x 6 = 1200
> > events to schedule and 6 to find on each cgroup switch
>
> Right, so here we could optimize; if we find the event-groups are
> identical in composition we can probably frob something that swizzles
> the counts around without touching the PMU. That would be very similar
> to what we already have in perf_event_context_sched_out().

Right, that's what we want.

>
> This gets a wee bit tricky when you consider cgroup hierarchy though;
> suppose you have:
>
>            R
>           / \
>          A   B
>             / \
>            C   D
>
> And are monitoring both B and D, then you'll end up with 12 counters
> active instead of the 6. I'm not sure how to make that go away. 'Don't
> do that then' might be good enough.

In my approach, it propagates the delta to the parents (if exist)
all the way to the root cgroup.

>
> > This does not scale for us:
> >    - run against the fd limit, but also memory consumption in the
> >    kernel per struct file, struct inode, struct perf_event ....
> >    - number of events per-cpu is still also large
> >    - require event scheduling on cgroup switches, even with RB-tree
> >    improvements, still heavy
> >    - require event scheduling even if measuring the same events across
> >    all cgroups
> >
> > One factor in that equation above needs to disappear. The one counter
> > per file descriptor is respected with Nahmyung's patch because he is
> > operating a plain per-cpu mode. What changes is just how and where the
> > count is accumulated in perf_events. The resulting programming on the
> > hardware is the same as before.
>
> Yes, you're aggregating differently. And that's exactly the problem. The
> aggregation is a variable one with fairly poor semantics. Suppose you
> create a new cgroup, then you have to tear down and recreate the whole
> thing, which is pretty crap.

Yep, but I think cgroup aggregation is an important use case and
we'd better support it efficiently.

Tracking all cgroups (including new one) can be difficult, that's why
I suggested passing a list of interested cgroups and counting them
only.  I can change it to allow adding new cgroups without tearing
down the existing list.  Is that ok to you?

>
> Ftrace had a similar issue; where people wanted aggregation, and that
> resulted in the event histogram, which, quite frankla,y is a scary
> monster that I've no intention of duplicating. That's half a programming
> language implemented.

The ftrace event histogram supports generic aggregation.  IOW users
can specify which key and data field to aggregate.  That surely would
complicate the things.

>
> > As you point out, the difficulty is how to express the cgroups of
> > interest and how to read the counts back.  I agree that the ioctl() is
> > not ideal for the latter. For the former, if you do not want ioctl()
> > then you would have to overload perf_event_open() with a vector of
> > cgroup fd, for instance. As for the read, you could, as you suggest,
> > use the read syscall if you want to read all the cgroups at once using
> > a new read_format. I don't have a problem with that.  As for cgroup-id
> > vs. cgroup-fd, I think you make a fair point about consistency with
> > the existing approach. I don't have a problem with that either
>
> So that is a problem of aggregation; which is basically a
> programmability problem. You're asking for a variadic-fixed-function
> now, but tomorrow someone else will come and want another one.

Well.. maybe we can add more stuff later if it's really needed.
But BPF also can handle many aggregations these days. :)

Thanks,
Namhyung
Namhyung Kim May 3, 2021, 9:53 p.m. UTC | #20
Hi Peter,

On Wed, Apr 21, 2021 at 12:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Apr 20, 2021 at 8:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Apr 20, 2021 at 01:34:40AM -0700, Stephane Eranian wrote:
> > > This does not scale for us:
> > >    - run against the fd limit, but also memory consumption in the
> > >    kernel per struct file, struct inode, struct perf_event ....
> > >    - number of events per-cpu is still also large
> > >    - require event scheduling on cgroup switches, even with RB-tree
> > >    improvements, still heavy
> > >    - require event scheduling even if measuring the same events across
> > >    all cgroups
> > >
> > > One factor in that equation above needs to disappear. The one counter
> > > per file descriptor is respected with Nahmyung's patch because he is
> > > operating a plain per-cpu mode. What changes is just how and where the
> > > count is accumulated in perf_events. The resulting programming on the
> > > hardware is the same as before.
> >
> > Yes, you're aggregating differently. And that's exactly the problem. The
> > aggregation is a variable one with fairly poor semantics. Suppose you
> > create a new cgroup, then you have to tear down and recreate the whole
> > thing, which is pretty crap.
>
> Yep, but I think cgroup aggregation is an important use case and
> we'd better support it efficiently.
>
> Tracking all cgroups (including new one) can be difficult, that's why
> I suggested passing a list of interested cgroups and counting them
> only.  I can change it to allow adding new cgroups without tearing
> down the existing list.  Is that ok to you?

Trying to move it forward..  I'll post v4 if you don't object to adding
new cgroup nodes while keeping the existing ones.

Thanks,
Namhyung


>
> >
> > Ftrace had a similar issue; where people wanted aggregation, and that
> > resulted in the event histogram, which, quite frankla,y is a scary
> > monster that I've no intention of duplicating. That's half a programming
> > language implemented.
>
> The ftrace event histogram supports generic aggregation.  IOW users
> can specify which key and data field to aggregate.  That surely would
> complicate the things.
>
> >
> > > As you point out, the difficulty is how to express the cgroups of
> > > interest and how to read the counts back.  I agree that the ioctl() is
> > > not ideal for the latter. For the former, if you do not want ioctl()
> > > then you would have to overload perf_event_open() with a vector of
> > > cgroup fd, for instance. As for the read, you could, as you suggest,
> > > use the read syscall if you want to read all the cgroups at once using
> > > a new read_format. I don't have a problem with that.  As for cgroup-id
> > > vs. cgroup-fd, I think you make a fair point about consistency with
> > > the existing approach. I don't have a problem with that either
> >
> > So that is a problem of aggregation; which is basically a
> > programmability problem. You're asking for a variadic-fixed-function
> > now, but tomorrow someone else will come and want another one.
>
> Well.. maybe we can add more stuff later if it's really needed.
> But BPF also can handle many aggregations these days. :)
>
> Thanks,
> Namhyung
Namhyung Kim May 9, 2021, 7:13 a.m. UTC | #21
Hi Peter,

Thinking about the interface a bit more...

On Fri, Apr 16, 2021 at 4:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> > On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> > >
> > > > > So I think we've had proposals for being able to close fds in the past;
> > > > > while preserving groups etc. We've always pushed back on that because of
> > > > > the resource limit issue. By having each counter be a filedesc we get a
> > > > > natural limit on the amount of resources you can consume. And in that
> > > > > respect, having to use 400k fds is things working as designed.
> > > > >
> > > > > Anyway, there might be a way around this..
> > >
> > > So how about we flip the whole thing sideways, instead of doing one
> > > event for multiple cgroups, do an event for multiple-cpus.
> > >
> > > Basically, allow:
> > >
> > >         perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> > >
> > > Which would have the kernel create nr_cpus events [the corrolary is that
> > > we'd probably also allow: (.pid=-1, cpu=-1) ].
> >
> > Do you mean it'd have separate perf_events per cpu internally?
> > From a cpu's perspective, there's nothing changed, right?
> > Then it will have the same performance problem as of now.
>
> Yes, but we'll not end up in ioctl() hell. The interface is sooo much
> better. The performance thing just means we need to think harder.

So I'd like to have vector support for cgroups but it could be
extended later.  So open with a flag that it'd accept a vector

    fd = perf_event_open(.pid=-1, .cpu=N, .flag=VECTOR);

Then it'd still need an additional interface (probably ioctl) to
set (or append) the vector.

    ioctl(fd, ADD_VECTOR, { .type = VEC_CGROUP, .nr = N, ... });

Maybe we also need to add FORMAT_VECTOR and use read(v)
or friends to read the contents for each entry.  It'd be nice
if it can have a vector-specific info like cgroup-id in this case.

What do you think?

Thanks,
Namhyung

Patch
diff mbox series

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..4b03cbadf4a0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,19 @@  struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
+
+	/* to share an event for multiple cgroups */
+	struct hlist_head		*cgrp_node_hash;
+	struct perf_cgroup_node		*cgrp_node_entries;
+	int				nr_cgrp_nodes;
+	int				cgrp_node_hash_bits;
+
+	struct list_head		cgrp_node_entry;
+
+	/* snapshot of previous reading (for perf_cgroup_node below) */
+	u64				cgrp_node_count;
+	u64				cgrp_node_time_enabled;
+	u64				cgrp_node_time_running;
 #endif
 
 #ifdef CONFIG_SECURITY
@@ -780,6 +793,13 @@  struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_cgroup_node {
+	struct hlist_node		node;
+	u64				id;
+	u64				count;
+	u64				time_enabled;
+	u64				time_running;
+} ____cacheline_aligned;
 
 struct perf_event_groups {
 	struct rb_root	tree;
@@ -843,6 +863,8 @@  struct perf_event_context {
 	int				pin_count;
 #ifdef CONFIG_CGROUP_PERF
 	int				nr_cgroups;	 /* cgroup evts */
+	struct list_head		cgrp_node_list;
+	struct list_head		cgrp_ctx_entry;
 #endif
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@  struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT		_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF		_IOWR('$', 10, struct perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES	_IOW('$', 11, struct perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP		_IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP		_IOWR('$', 13, __u64 *)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..bcf51c0b7855 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@  enum event_type_t {
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
  */
 
+static void perf_sched_enable(void);
 static void perf_sched_delayed(struct work_struct *work);
 DEFINE_STATIC_KEY_FALSE(perf_sched_events);
 static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
@@ -2124,6 +2125,323 @@  static int perf_get_aux_event(struct perf_event *event,
 	return 1;
 }
 
+#ifdef CONFIG_CGROUP_PERF
+static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
+
+static bool event_can_attach_cgroup(struct perf_event *event)
+{
+	if (is_sampling_event(event))
+		return false;
+	if (event->attach_state & PERF_ATTACH_TASK)
+		return false;
+	if (is_cgroup_event(event))
+		return false;
+
+	return true;
+}
+
+static bool event_has_cgroup_node(struct perf_event *event)
+{
+	return event->nr_cgrp_nodes > 0;
+}
+
+static struct perf_cgroup_node *
+find_cgroup_node(struct perf_event *event, u64 cgrp_id)
+{
+	struct perf_cgroup_node *cgrp_node;
+	int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
+
+	hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
+		if (cgrp_node->id == cgrp_id)
+			return cgrp_node;
+	}
+
+	return NULL;
+}
+
+static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+	u64 delta_count, delta_time_enabled, delta_time_running;
+	int i;
+
+	if (event->cgrp_node_count == 0)
+		goto out;
+
+	delta_count = local64_read(&event->count) - event->cgrp_node_count;
+	delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
+	delta_time_running = event->total_time_running - event->cgrp_node_time_running;
+
+	/* account delta to all ancestor cgroups */
+	for (i = 0; i <= cgrp->level; i++) {
+		struct perf_cgroup_node *node;
+
+		node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
+		if (node) {
+			node->count += delta_count;
+			node->time_enabled += delta_time_enabled;
+			node->time_running += delta_time_running;
+		}
+	}
+
+out:
+	event->cgrp_node_count = local64_read(&event->count);
+	event->cgrp_node_time_enabled = event->total_time_enabled;
+	event->cgrp_node_time_running = event->total_time_running;
+}
+
+static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+	if (event->state == PERF_EVENT_STATE_ACTIVE)
+		event->pmu->read(event);
+
+	perf_event_update_time(event);
+	perf_update_cgroup_node(event, cgrp);
+}
+
+/* this is called from context switch */
+static void update_cgroup_node_events(struct perf_event_context *ctx,
+				      struct cgroup *cgrp)
+{
+	struct perf_event *event;
+
+	lockdep_assert_held(&ctx->lock);
+
+	if (ctx->is_active & EVENT_TIME)
+		update_context_time(ctx);
+
+	list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
+		update_cgroup_node(event, cgrp);
+}
+
+static void cgroup_node_sched_out(struct task_struct *task)
+{
+	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
+	struct perf_event_context *ctx;
+
+	list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
+		raw_spin_lock(&ctx->lock);
+		update_cgroup_node_events(ctx, cgrp->css.cgroup);
+		raw_spin_unlock(&ctx->lock);
+	}
+}
+
+/* these are called from the when an event is enabled/disabled */
+static void perf_add_cgrp_node_list(struct perf_event *event,
+				    struct perf_event_context *ctx)
+{
+	struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+	bool is_first;
+
+	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(&ctx->lock);
+
+	is_first = list_empty(&ctx->cgrp_node_list);
+	list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
+
+	if (is_first)
+		list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
+
+	update_cgroup_node(event, cgrp->css.cgroup);
+}
+
+static void perf_del_cgrp_node_list(struct perf_event *event,
+				    struct perf_event_context *ctx)
+{
+	struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+
+	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(&ctx->lock);
+
+	update_cgroup_node(event, cgrp->css.cgroup);
+	/* to refresh delta when it's enabled */
+	event->cgrp_node_count = 0;
+
+	list_del(&event->cgrp_node_entry);
+
+	if (list_empty(&ctx->cgrp_node_list))
+		list_del(&ctx->cgrp_ctx_entry);
+}
+
+static void perf_attach_cgroup_node(struct perf_event *event,
+				    struct perf_cpu_context *cpuctx,
+				    struct perf_event_context *ctx,
+				    void *data)
+{
+	if (ctx->is_active & EVENT_TIME)
+		update_context_time(ctx);
+
+	perf_add_cgrp_node_list(event, ctx);
+}
+
+#define MIN_CGRP_NODE_HASH  4
+#define MAX_CGRP_NODE_HASH  (4 * 1024)
+
+/* this is called from ioctl */
+static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
+					 u64 *cgroup_ids)
+{
+	struct perf_cgroup_node *cgrp_node;
+	struct perf_event_context *ctx = event->ctx;
+	struct hlist_head *cgrp_node_hash;
+	int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
+	unsigned long flags;
+	bool is_first = true;
+	bool enabled;
+	int i, nr_hash;
+	int hash_bits;
+
+	if (nr_cgrps < MIN_CGRP_NODE_HASH)
+		nr_hash = MIN_CGRP_NODE_HASH;
+	else
+		nr_hash = roundup_pow_of_two(nr_cgrps);
+	hash_bits = ilog2(nr_hash);
+
+	cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
+				      GFP_KERNEL, node);
+	if (cgrp_node_hash == NULL)
+		return -ENOMEM;
+
+	cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
+	if (cgrp_node == NULL) {
+		kfree(cgrp_node_hash);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < (int)nr_cgrps; i++) {
+		int key = hash_64(cgroup_ids[i], hash_bits);
+
+		cgrp_node[i].id = cgroup_ids[i];
+		hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
+	}
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
+
+	if (event->nr_cgrp_nodes != 0) {
+		kfree(event->cgrp_node_hash);
+		kfree(event->cgrp_node_entries);
+		is_first = false;
+	}
+
+	event->cgrp_node_hash = cgrp_node_hash;
+	event->cgrp_node_entries = cgrp_node;
+	event->cgrp_node_hash_bits = hash_bits;
+	event->nr_cgrp_nodes = nr_cgrps;
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	if (is_first && enabled)
+		event_function_call(event, perf_attach_cgroup_node, NULL);
+
+	return 0;
+}
+
+static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
+{
+	struct perf_event_context *ctx = event->ctx;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	if (event_has_cgroup_node(event)) {
+		if (!atomic_add_unless(&perf_sched_count, -1, 1))
+			schedule_delayed_work(&perf_sched_work, HZ);
+	}
+
+	kfree(event->cgrp_node_hash);
+	kfree(event->cgrp_node_entries);
+	event->nr_cgrp_nodes = 0;
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+static int perf_event_read(struct perf_event *event, bool group);
+
+static void __perf_read_cgroup_node(struct perf_event *event)
+{
+	struct perf_cgroup *cgrp;
+
+	if (event_has_cgroup_node(event)) {
+		cgrp = perf_cgroup_from_task(current, NULL);
+		perf_update_cgroup_node(event, cgrp->css.cgroup);
+	}
+}
+
+static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
+				       u64 cgrp_id, char __user *buf)
+{
+	struct perf_cgroup_node *cgrp;
+	struct perf_event_context *ctx = event->ctx;
+	unsigned long flags;
+	u64 read_format = event->attr.read_format;
+	u64 values[4];
+	int n = 0;
+
+	/* update event count and times (possibly run on other cpu) */
+	(void)perf_event_read(event, false);
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	cgrp = find_cgroup_node(event, cgrp_id);
+	if (cgrp == NULL) {
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		return -ENOENT;
+	}
+
+	values[n++] = cgrp->count;
+	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+		values[n++] = cgrp->time_enabled;
+	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+		values[n++] = cgrp->time_running;
+	if (read_format & PERF_FORMAT_ID)
+		values[n++] = primary_event_id(event);
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	if (copy_to_user(buf, values, n * sizeof(u64)))
+		return -EFAULT;
+
+	return n * sizeof(u64);
+}
+#else  /* !CONFIG_CGROUP_PERF */
+static inline bool event_can_attach_cgroup(struct perf_event *event)
+{
+	return false;
+}
+
+static inline bool event_has_cgroup_node(struct perf_event *event)
+{
+	return false;
+}
+
+static inline void cgroup_node_sched_out(struct task_struct *task) {}
+
+static inline void perf_add_cgrp_node_list(struct perf_event *event,
+					   struct perf_event_context *ctx) {}
+static inline void perf_del_cgrp_node_list(struct perf_event *event,
+					   struct perf_event_context *ctx) {}
+
+#define MAX_CGRP_NODE_HASH  1
+static inline int perf_event_attach_cgroup_node(struct perf_event *event,
+						u64 nr_cgrps, u64 *cgrp_ids)
+{
+	return -ENODEV;
+}
+
+static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
+static inline  void __perf_read_cgroup_node(struct perf_event *event) {}
+
+static inline int perf_event_read_cgroup_node(struct perf_event *event,
+					      u64 read_size, u64 cgrp_id,
+					      char __user *buf)
+{
+	return -EINVAL;
+}
+#endif  /* CONFIG_CGROUP_PERF */
+
 static inline struct list_head *get_event_list(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -2407,6 +2725,7 @@  static void __perf_event_disable(struct perf_event *event,
 
 	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 	perf_cgroup_event_disable(event, ctx);
+	perf_del_cgrp_node_list(event, ctx);
 }
 
 /*
@@ -2946,6 +3265,7 @@  static void __perf_event_enable(struct perf_event *event,
 
 	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 	perf_cgroup_event_enable(event, ctx);
+	perf_add_cgrp_node_list(event, ctx);
 
 	if (!ctx->is_active)
 		return;
@@ -3568,6 +3888,13 @@  void __perf_event_task_sched_out(struct task_struct *task,
 	 */
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
 		perf_cgroup_sched_out(task, next);
+
+#ifdef CONFIG_CGROUP_PERF
+	if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
+	    perf_cgroup_from_task(task, NULL) !=
+	    perf_cgroup_from_task(next, NULL))
+		cgroup_node_sched_out(task);
+#endif
 }
 
 /*
@@ -4268,6 +4595,7 @@  static void __perf_event_read(void *info)
 
 	if (!data->group) {
 		pmu->read(event);
+		__perf_read_cgroup_node(event);
 		data->ret = 0;
 		goto unlock;
 	}
@@ -4283,6 +4611,7 @@  static void __perf_event_read(void *info)
 			 * sibling could be on different (eg: software) PMU.
 			 */
 			sub->pmu->read(sub);
+			__perf_read_cgroup_node(sub);
 		}
 	}
 
@@ -4462,6 +4791,10 @@  static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->pinned_active);
 	INIT_LIST_HEAD(&ctx->flexible_active);
 	refcount_set(&ctx->refcount, 1);
+#ifdef CONFIG_CGROUP_PERF
+	INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
+	INIT_LIST_HEAD(&ctx->cgrp_node_list);
+#endif
 }
 
 static struct perf_event_context *
@@ -4851,6 +5184,8 @@  static void _free_event(struct perf_event *event)
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
+	perf_event_destroy_cgroup_nodes(event);
+
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
 			put_callchain_buffers();
@@ -5571,6 +5906,58 @@  static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 
 		return perf_event_modify_attr(event,  &new_attr);
 	}
+
+	case PERF_EVENT_IOC_ATTACH_CGROUP: {
+		u64 nr_cgrps;
+		u64 *cgrp_buf;
+		size_t cgrp_bufsz;
+		int ret;
+
+		if (!event_can_attach_cgroup(event))
+			return -EINVAL;
+
+		if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
+				   sizeof(nr_cgrps)))
+			return -EFAULT;
+
+		if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
+			return -EINVAL;
+
+		cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
+
+		cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
+		if (cgrp_buf == NULL)
+			return -ENOMEM;
+
+		if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
+				   cgrp_bufsz)) {
+			kfree(cgrp_buf);
+			return -EFAULT;
+		}
+
+		ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
+
+		kfree(cgrp_buf);
+		return ret;
+	}
+
+	case PERF_EVENT_IOC_READ_CGROUP: {
+		u64 read_size, cgrp_id;
+
+		if (!event_can_attach_cgroup(event))
+			return -EINVAL;
+
+		if (copy_from_user(&read_size, (u64 __user *)arg,
+				   sizeof(read_size)))
+			return -EFAULT;
+		if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
+				   sizeof(cgrp_id)))
+			return -EFAULT;
+
+		return perf_event_read_cgroup_node(event, read_size, cgrp_id,
+						   (char __user *)(arg + 16));
+	}
+
 	default:
 		return -ENOTTY;
 	}
@@ -5583,10 +5970,39 @@  static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	return 0;
 }
 
+static void perf_sched_enable(void)
+{
+	/*
+	 * We need the mutex here because static_branch_enable()
+	 * must complete *before* the perf_sched_count increment
+	 * becomes visible.
+	 */
+	if (atomic_inc_not_zero(&perf_sched_count))
+		return;
+
+	mutex_lock(&perf_sched_mutex);
+	if (!atomic_read(&perf_sched_count)) {
+		static_branch_enable(&perf_sched_events);
+		/*
+		 * Guarantee that all CPUs observe they key change and
+		 * call the perf scheduling hooks before proceeding to
+		 * install events that need them.
+		 */
+		synchronize_rcu();
+	}
+	/*
+	 * Now that we have waited for the sync_sched(), allow further
+	 * increments to by-pass the mutex.
+	 */
+	atomic_inc(&perf_sched_count);
+	mutex_unlock(&perf_sched_mutex);
+}
+
 static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct perf_event *event = file->private_data;
 	struct perf_event_context *ctx;
+	bool do_sched_enable = false;
 	long ret;
 
 	/* Treat ioctl like writes as it is likely a mutating operation. */
@@ -5595,9 +6011,19 @@  static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return ret;
 
 	ctx = perf_event_ctx_lock(event);
+	/* ATTACH_CGROUP requires context switch callback */
+	if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
+		do_sched_enable = true;
 	ret = _perf_ioctl(event, cmd, arg);
 	perf_event_ctx_unlock(event, ctx);
 
+	/*
+	 * Due to the circular lock dependency, it cannot call
+	 * static_branch_enable() under the ctx->mutex.
+	 */
+	if (do_sched_enable && ret >= 0)
+		perf_sched_enable();
+
 	return ret;
 }
 
@@ -11240,33 +11666,8 @@  static void account_event(struct perf_event *event)
 	if (event->attr.text_poke)
 		atomic_inc(&nr_text_poke_events);
 
-	if (inc) {
-		/*
-		 * We need the mutex here because static_branch_enable()
-		 * must complete *before* the perf_sched_count increment
-		 * becomes visible.
-		 */
-		if (atomic_inc_not_zero(&perf_sched_count))
-			goto enabled;
-
-		mutex_lock(&perf_sched_mutex);
-		if (!atomic_read(&perf_sched_count)) {
-			static_branch_enable(&perf_sched_events);
-			/*
-			 * Guarantee that all CPUs observe they key change and
-			 * call the perf scheduling hooks before proceeding to
-			 * install events that need them.
-			 */
-			synchronize_rcu();
-		}
-		/*
-		 * Now that we have waited for the sync_sched(), allow further
-		 * increments to by-pass the mutex.
-		 */
-		atomic_inc(&perf_sched_count);
-		mutex_unlock(&perf_sched_mutex);
-	}
-enabled:
+	if (inc)
+		perf_sched_enable();
 
 	account_event_cpu(event, event->cpu);
 
@@ -13008,6 +13409,7 @@  static void __init perf_event_init_all_cpus(void)
 
 #ifdef CONFIG_CGROUP_PERF
 		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
+		INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
 #endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
 	}
@@ -13218,6 +13620,29 @@  static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static int __perf_cgroup_update_node(void *info)
+{
+	struct task_struct *task = info;
+
+	rcu_read_lock();
+	cgroup_node_sched_out(task);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+/* update cgroup counter BEFORE task's cgroup is changed */
+static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(task, css, tset)
+		task_function_call(task, __perf_cgroup_update_node, task);
+
+	return 0;
+}
+
 static int __perf_cgroup_move(void *info)
 {
 	struct task_struct *task = info;
@@ -13240,6 +13665,7 @@  struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
 	.css_free	= perf_cgroup_css_free,
 	.css_online	= perf_cgroup_css_online,
+	.can_attach	= perf_cgroup_can_attach,
 	.attach		= perf_cgroup_attach,
 	/*
 	 * Implicitly enable on dfl hierarchy so that perf events can