Re: [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together

From: Thomas Gleixner <tglx@linutronix.de>
To: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: vikas.shivappa@intel.com, davidcc@google.com, eranian@google.com,
	linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com,
	mingo@kernel.org, peterz@infradead.org, ravi.v.shankar@intel.com,
	tony.luck@intel.com, fenghua.yu@intel.com, andi.kleen@intel.com,
	h.peter.anvin@intel.com
Subject: Re: [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together
Date: Tue, 17 Jan 2017 17:11:55 +0100 (CET)	[thread overview]
Message-ID: <alpine.DEB.2.20.1701171652270.3495@nanos> (raw)
In-Reply-To: <1483740005-23499-9-git-send-email-vikas.shivappa@linux.intel.com>

On Fri, 6 Jan 2017, Vikas Shivappa wrote:

> From: Vikas Shivappa<vikas.shivappa@intel.com>
> 
> This patch adds support to monitor a cgroup x and a task p1
> when p1 is part of cgroup x. Since we cannot write two RMIDs during
> sched in the driver handles this.

Again you explain WHAT not WHY....

> This patch introduces a u32 *rmid in the task_struck which keeps track
> of the RMIDs associated with the task.  There is also a list in the
> arch_info of perf_cgroup called taskmon_list which keeps track of tasks
> in the cgroup that are monitored.
> 
> The taskmon_list is modified in 2 scenarios.
> - at event_init of task p1 which is part of a cgroup, add the p1 to the
> cgroup->tskmon_list. At event_destroy delete the task from the list.
> - at the time of task move from cgrp x to a cgp y, if task was monitored
> remove the task from the cgrp x tskmon_list and add it the cgrp y
> tskmon_list.
> 
> sched in: When the task p1 is scheduled in, we write the task RMID in
> the PQR_ASSOC MSR

Great information.

> read(for task p1): As any other cqm task event
> 
> read(for the cgroup x): When counting for cgroup, the taskmon list is
> traversed and the corresponding RMID counts are added.
> 
> Tests: Monitoring a cgroup x and a task with in the cgroup x should
> work.

Emphasis on should.

> +static inline int add_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
> +{
> +	struct tsk_rmid_entry *entry;
> +
> +	entry = kzalloc(sizeof(struct tsk_rmid_entry), GFP_KERNEL);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&entry->list);
> +	entry->rmid = rmid;
> +
> +	list_add_tail(&entry->list, l);
> +
> +	return 0;

And this function has a return value because the return value is never
evaluated at any call site.

> +}
> +
> +static inline void del_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
> +{
> +	struct tsk_rmid_entry *entry = NULL, *tmp1;

And where is *tmp2? What is wrong with simply *tmp?

> +
> +	list_for_each_entry_safe(entry, tmp1, l, list) {
> +		if (entry->rmid == rmid) {
> +
> +			list_del(&entry->list);
> +			kfree(entry);
> +			break;
> +		}
> +	}
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>  {
> @@ -380,6 +410,49 @@ struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>  }
>  #endif
>  
> +static inline void
> +	cgrp_tskmon_update(struct task_struct *tsk, u32 *rmid, bool ena)

Sigh

> +{
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +
> +#ifdef CONFIG_CGROUP_PERF
> +	ccinfo = cqminfo_from_tsk(tsk);
> +#endif
> +	if (!ccinfo)
> +		return;
> +
> +	if (ena)
> +		add_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
> +	else
> +		del_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
> +}
> +
> +static int cqm_assign_task_rmid(struct perf_event *event, u32 *rmid)
> +{
> +	struct task_struct *tsk;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +	tsk = event->hw.target;
> +	if (pid_alive(tsk)) {
> +		get_task_struct(tsk);

This works because after the pid_alive() check the task cannot be
released before issuing get_task_struct(), right?

That's voodoo protection. How would a non alive task end up as event
target?

> +
> +		if (rmid != NULL)
> +			cgrp_tskmon_update(tsk, rmid, true);
> +		else
> +			cgrp_tskmon_update(tsk, tsk->rmid, false);
> +
> +		tsk->rmid = rmid;
> +
> +		put_task_struct(tsk);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>  {
>  	if (rmid != NULL) {
> @@ -429,8 +502,12 @@ static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)
>  
>  static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
>  {
> +	if (is_task_event(event)) {
> +		if (cqm_assign_task_rmid(event, rmid))
> +			return -EINVAL;
> +	}
>  #ifdef CONFIG_CGROUP_PERF
> -	if (is_cgroup_event(event)) {
> +	else if (is_cgroup_event(event)) {
>  		cqm_assign_hier_rmid(&event->cgrp->css, rmid);
>  	}

So you keep adding stuff to cqm_assign_rmid() which handles enable and
disable. But the only call site is in cqm_event_free_rmid() which calls
that function with rmid = NULL, i.e. disable.

Can you finally explain how this is supposed to work and how all of this
has been tested and validated?

If you had used the already known 'Tests: Same as before' line to the
changelog, then we would have known that it's broken as before w/o looking
at the patch.

So the new variant of 'broken' is: Bla should work ....

Thanks,

	tglx