All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <zhouchengming@bytedance.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: tj@kernel.org, corbet@lwn.net, surenb@google.com,
	mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	songmuchun@bytedance.com
Subject: Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
Date: Tue, 16 Aug 2022 21:06:21 +0800	[thread overview]
Message-ID: <904851a7-7b01-8689-3ec1-2a61f8244841@bytedance.com> (raw)
In-Reply-To: <YvprI6ZL8dVWGyBO@cmpxchg.org>

On 2022/8/15 23:49, Johannes Weiner wrote:
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
>> +				char *buf, size_t nbytes, loff_t off)
>> +{
>> +	ssize_t ret;
>> +	int enable;
>> +	struct cgroup *cgrp;
>> +	struct psi_group *psi;
>> +
>> +	ret = kstrtoint(strstrip(buf), 0, &enable);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (enable < 0 || enable > 1)
>> +		return -ERANGE;
>> +
>> +	cgrp = cgroup_kn_lock_live(of->kn, false);
>> +	if (!cgrp)
>> +		return -ENOENT;
>> +
>> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
>> +	psi_cgroup_enable(psi, enable);
> 
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

Ok, I will look.

> 
>> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>>  		.release = cgroup_pressure_release,
>>  	},
>>  #endif
>> +	{
>> +		.name = "cgroup.psi",
>> +		.flags = CFTYPE_PRESSURE,
>> +		.seq_show = cgroup_psi_show,
>> +		.write = cgroup_psi_write,
>> +	},
>>  #endif /* CONFIG_PSI */
>>  	{ }	/* terminate */
>>  };
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index 58f8092c938f..9df1686ee02d 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>>  {
>>  	int cpu;
>>  
>> +	group->enabled = true;
>>  	for_each_possible_cpu(cpu)
>>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>>  	group->avg_last_update = sched_clock();
>> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>>  
>>  	/*
>> -	 * First we assess the aggregate resource states this CPU's
>> -	 * tasks have been in since the last change, and account any
>> -	 * SOME and FULL time these may have resulted in.
>> -	 *
>> -	 * Then we update the task counts according to the state
>> +	 * First we update the task counts according to the state
>>  	 * change requested through the @clear and @set bits.
>> +	 *
>> +	 * Then if the cgroup PSI stats accounting enabled, we
>> +	 * assess the aggregate resource states this CPU's tasks
>> +	 * have been in since the last change, and account any
>> +	 * SOME and FULL time these may have resulted in.
>>  	 */
>>  	write_seqcount_begin(&groupc->seq);
>>  
>> -	record_times(groupc, now);
>> -
>>  	/*
>>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>>  	 * task count - it's just a boolean flag directly encoded in
>> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  		if (set & (1 << t))
>>  			groupc->tasks[t]++;
>>  
>> +	if (!group->enabled) {
>> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
>> +			record_times(groupc, now);
> 
> Why record the nonidle time? It's only used for aggregation, which is
> stopped as well.

I'm considering of this situation: disable at t2 and re-enable at t3

state1(t1) --> state2(t2) --> state3(t3)

If aggregator has get_recent_times() in [t1, t2], groupc->times_prev[aggregator]
will include that delta of (t - t1).

Then re-enable at t3, the delta of (t3-t1) is discarded, may make that aggregator
see times < groupc->times_prev[aggregator] ?

Maybe I missed something, not sure whether this is a problem.


> 
>> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>>  
>>  	task_rq_unlock(rq, task, &rf);
>>  }
>> +
>> +void psi_cgroup_enable(struct psi_group *group, bool enable)
>> +{
>> +	struct psi_group_cpu *groupc;
>> +	int cpu;
>> +	u64 now;
>> +
>> +	if (group->enabled == enable)
>> +		return;
>> +	group->enabled = enable;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		groupc = per_cpu_ptr(group->pcpu, cpu);
>> +		now = cpu_clock(cpu);
>> +		psi_group_change(group, cpu, 0, 0, now, true);
> 
> This loop deserves a comment, IMO.

I add some comments as below, could you help take a look?

+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+       int cpu;
+       u64 now;
+
+       if (group->enabled == enable)
+               return;
+       group->enabled = enable;
+
+       /*
+        * We use psi_group_change() to disable or re-enable the
+        * record_times(), test_state() loop and averaging worker
+        * in each psi_group_cpu of the psi_group, use .clear = 0
+        * and .set = 0 here since no task status really changed.
+        */
+       for_each_possible_cpu(cpu) {
+               now = cpu_clock(cpu);
+               psi_group_change(group, cpu, 0, 0, now, true);
+       }
+}

Thanks!


WARNING: multiple messages have this Message-ID (diff)
From: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
To: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	corbet-T1hC0tSOHrs@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
	dietmar.eggemann-5wv7dgnIgG8@public.gmane.org,
	rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org,
	bsegall-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org
Subject: Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
Date: Tue, 16 Aug 2022 21:06:21 +0800	[thread overview]
Message-ID: <904851a7-7b01-8689-3ec1-2a61f8244841@bytedance.com> (raw)
In-Reply-To: <YvprI6ZL8dVWGyBO-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

On 2022/8/15 23:49, Johannes Weiner wrote:
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
>> +				char *buf, size_t nbytes, loff_t off)
>> +{
>> +	ssize_t ret;
>> +	int enable;
>> +	struct cgroup *cgrp;
>> +	struct psi_group *psi;
>> +
>> +	ret = kstrtoint(strstrip(buf), 0, &enable);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (enable < 0 || enable > 1)
>> +		return -ERANGE;
>> +
>> +	cgrp = cgroup_kn_lock_live(of->kn, false);
>> +	if (!cgrp)
>> +		return -ENOENT;
>> +
>> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
>> +	psi_cgroup_enable(psi, enable);
> 
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

Ok, I will look.

> 
>> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>>  		.release = cgroup_pressure_release,
>>  	},
>>  #endif
>> +	{
>> +		.name = "cgroup.psi",
>> +		.flags = CFTYPE_PRESSURE,
>> +		.seq_show = cgroup_psi_show,
>> +		.write = cgroup_psi_write,
>> +	},
>>  #endif /* CONFIG_PSI */
>>  	{ }	/* terminate */
>>  };
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index 58f8092c938f..9df1686ee02d 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>>  {
>>  	int cpu;
>>  
>> +	group->enabled = true;
>>  	for_each_possible_cpu(cpu)
>>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>>  	group->avg_last_update = sched_clock();
>> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>>  
>>  	/*
>> -	 * First we assess the aggregate resource states this CPU's
>> -	 * tasks have been in since the last change, and account any
>> -	 * SOME and FULL time these may have resulted in.
>> -	 *
>> -	 * Then we update the task counts according to the state
>> +	 * First we update the task counts according to the state
>>  	 * change requested through the @clear and @set bits.
>> +	 *
>> +	 * Then if the cgroup PSI stats accounting enabled, we
>> +	 * assess the aggregate resource states this CPU's tasks
>> +	 * have been in since the last change, and account any
>> +	 * SOME and FULL time these may have resulted in.
>>  	 */
>>  	write_seqcount_begin(&groupc->seq);
>>  
>> -	record_times(groupc, now);
>> -
>>  	/*
>>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>>  	 * task count - it's just a boolean flag directly encoded in
>> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  		if (set & (1 << t))
>>  			groupc->tasks[t]++;
>>  
>> +	if (!group->enabled) {
>> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
>> +			record_times(groupc, now);
> 
> Why record the nonidle time? It's only used for aggregation, which is
> stopped as well.

I'm considering of this situation: disable at t2 and re-enable at t3

state1(t1) --> state2(t2) --> state3(t3)

If aggregator has get_recent_times() in [t1, t2], groupc->times_prev[aggregator]
will include that delta of (t - t1).

Then re-enable at t3, the delta of (t3-t1) is discarded, may make that aggregator
see times < groupc->times_prev[aggregator] ?

Maybe I missed something, not sure whether this is a problem.


> 
>> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>>  
>>  	task_rq_unlock(rq, task, &rf);
>>  }
>> +
>> +void psi_cgroup_enable(struct psi_group *group, bool enable)
>> +{
>> +	struct psi_group_cpu *groupc;
>> +	int cpu;
>> +	u64 now;
>> +
>> +	if (group->enabled == enable)
>> +		return;
>> +	group->enabled = enable;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		groupc = per_cpu_ptr(group->pcpu, cpu);
>> +		now = cpu_clock(cpu);
>> +		psi_group_change(group, cpu, 0, 0, now, true);
> 
> This loop deserves a comment, IMO.

I add some comments as below, could you help take a look?

+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+       int cpu;
+       u64 now;
+
+       if (group->enabled == enable)
+               return;
+       group->enabled = enable;
+
+       /*
+        * We use psi_group_change() to disable or re-enable the
+        * record_times(), test_state() loop and averaging worker
+        * in each psi_group_cpu of the psi_group, use .clear = 0
+        * and .set = 0 here since no task status really changed.
+        */
+       for_each_possible_cpu(cpu) {
+               now = cpu_clock(cpu);
+               psi_group_change(group, cpu, 0, 0, now, true);
+       }
+}

Thanks!


  parent reply	other threads:[~2022-08-16 13:06 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
2022-08-08 11:03 ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-08-08 11:03   ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
2022-08-16 10:40   ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
2022-08-09 17:48   ` Tejun Heo
2022-08-09 17:48     ` Tejun Heo
2022-08-10  0:39     ` Chengming Zhou
2022-08-10  0:39       ` Chengming Zhou
2022-08-10  1:30       ` Chengming Zhou
2022-08-10  1:30         ` Chengming Zhou
2022-08-10 15:25         ` Johannes Weiner
2022-08-10 17:27           ` Tejun Heo
2022-08-11  2:09             ` Chengming Zhou
2022-08-15 13:23           ` Michal Koutný
2022-08-15 13:23             ` Michal Koutný
2022-08-23  6:18             ` Chengming Zhou
2022-08-23  6:18               ` Chengming Zhou
2022-08-23 15:35               ` Johannes Weiner
2022-08-23 15:43                 ` Chengming Zhou
2022-08-23 15:43                   ` Chengming Zhou
2022-08-23 16:20                 ` Tejun Heo
2022-08-23 16:20                   ` Tejun Heo
2022-08-12 10:14   ` Michal Koutný
2022-08-12 10:14     ` Michal Koutný
2022-08-12 12:36     ` Chengming Zhou
2022-08-12 12:36       ` Chengming Zhou
2022-08-15 13:23       ` Michal Koutný
2022-08-15 15:49   ` Johannes Weiner
2022-08-15 19:50     ` Tejun Heo
2022-08-15 19:50       ` Tejun Heo
2022-08-16 13:06     ` Chengming Zhou [this message]
2022-08-16 13:06       ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
2022-08-15 13:25   ` Michal Koutný
2022-08-16 14:01   ` Chengming Zhou
2022-08-16 14:01     ` Chengming Zhou
2022-08-17 15:19     ` Chengming Zhou
2022-08-17 15:19       ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=904851a7-7b01-8689-3ec1-2a61f8244841@bytedance.com \
    --to=zhouchengming@bytedance.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.