All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <zhouchengming@bytedance.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: hannes@cmpxchg.org, quic_pkondeti@quicinc.com,
	peterz@infradead.org, quic_charante@quicinc.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Date: Thu, 13 Oct 2022 19:06:55 +0800	[thread overview]
Message-ID: <46c6e1cc-77d3-eac1-fa18-deed2bac4a0e@bytedance.com> (raw)
In-Reply-To: <CAJuCfpFTDyR1V+JYOY_uN6Xg1Nip5b=9dzkwm-CNd8vMWaQQFQ@mail.gmail.com>

On 2022/10/13 02:24, Suren Baghdasaryan wrote:
> On Tue, Oct 11, 2022 at 7:11 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2022/10/12 01:00, Suren Baghdasaryan wrote:
>>> On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
>>>>> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>
>>>>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>>>>>> working at all. Because PSI_NONIDLE condition would be observed in
>>>>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>>>>>> only the kworker running avgs_work on the CPU.
>>>>>>
>>>>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>>>>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>>>>>> still will always re-arm the avgs_work, so shutoff is not working.
>>>>>>
>>>>>> This patch changes to consider current CPU groupc as IDLE if the
>>>>>> kworker running avgs_work is the only task running and no IOWAIT
>>>>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>>>>>> if other CPUs' groupc are also IDLE.
>>>>>>
>>>>>> One potential problem is that the brief period of non-idle time
>>>>>> incurred between the aggregation run and the kworker's dequeue will
>>>>>> be stranded in the per-cpu buckets until avgs_work run next time.
>>>>>> The buckets can hold 4s worth of time, and future activity will wake
>>>>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>>>>>> behind when shut off the avgs_work. If the kworker run other works after
>>>>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>>>>>> this maybe a problem.
>>>>>>
>>>>>> Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
>>>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>>>
>>>>> Copying my comments from
>>>>> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
>>>>> in case you want to continue the discussion here...
>>>>>
>>>>>> ---
>>>>>>  kernel/sched/psi.c | 15 +++++++++++++++
>>>>>>  1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>> index ee2ecc081422..f4cdf6f184ba 100644
>>>>>> --- a/kernel/sched/psi.c
>>>>>> +++ b/kernel/sched/psi.c
>>>>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>>                              u32 *pchanged_states)
>>>>>>  {
>>>>>>         struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>>>> +       int current_cpu = raw_smp_processor_id();
>>>>>> +       bool only_avgs_work = false;
>>>>>>         u64 now, state_start;
>>>>>>         enum psi_states s;
>>>>>>         unsigned int seq;
>>>>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>>                 memcpy(times, groupc->times, sizeof(groupc->times));
>>>>>>                 state_mask = groupc->state_mask;
>>>>>>                 state_start = groupc->state_start;
>>>>>> +               /*
>>>>>> +                * This CPU has only avgs_work kworker running, snapshot the
>>>>>> +                * newest times then don't need to re-arm for this groupc.
>>>>>> +                * Normally this kworker will sleep soon and won't wake
>>>>>> +                * avgs_work back up in psi_group_change().
>>>>>> +                */
>>>>>> +               if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>>>> +                   !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>>>> +                       only_avgs_work = true;
>>>>>
>>>>> Why do you determine only_avgs_work while taking a snapshot? The
>>>>> read_seqcount_retry() might fail and the loop gets retried, which
>>>>> might lead to a wrong only_avgs_work value if the state changes
>>>>> between retries. I think it's safer to do this after the snapshot was
>>>>> taken and to use tasks[NR_RUNNING] instead of  roupc->tasks.
>>>>
>>>> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
>>>> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
>>>>
>>>> Another way is to add an else branch:
>>>>
>>>>                 if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>>                     !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>>                         only_avgs_work = true;
>>>>                 else
>>>>                         only_avgs_work = false;
>>>>
>>>> Both are ok for me.
>>>
>>> Let's use the simple way and use the tasks[] after the snapshot is taken.
>>
>> Ok, I changed like this:
>>
>>         struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>> +       int current_cpu = raw_smp_processor_id();
>> +       unsigned int tasks[NR_PSI_TASK_COUNTS];
>>         u64 now, state_start;
>>         enum psi_states s;
>>         unsigned int seq;
>> @@ -256,6 +258,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>                 memcpy(times, groupc->times, sizeof(groupc->times));
>>                 state_mask = groupc->state_mask;
>>                 state_start = groupc->state_start;
>> +               if (cpu == current_cpu)
>> +                       memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
>>         } while (read_seqcount_retry(&groupc->seq, seq));
>>
>>>
>>>>
>>>>>
>>>>>>         } while (read_seqcount_retry(&groupc->seq, seq));
>>>>>>
>>>>>>         /* Calculate state time deltas against the previous snapshot */
>>>>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>>                 if (delta)
>>>>>>                         *pchanged_states |= (1 << s);
>>>>>>         }
>>>>>> +
>>>>>> +       /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>>>> +       if (only_avgs_work)
>>>>>> +               *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>>>
>>>>> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
>>>>> only for re-arming psi_avgs_work, however semantically this is
>>>>> incorrect. The CPU was not idle when it was executing psi_avgs_work.
>>>>> IMO a separate flag would avoid this confusion.
>>>>
>>>> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
>>>> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
>>>> has been set.
>>>>
>>>>         for_each_possible_cpu(cpu) {
>>>>                 u32 times[NR_PSI_STATES];
>>>>                 u32 nonidle;
>>>>                 u32 cpu_changed_states;
>>>>
>>>>                 get_recent_times(group, cpu, aggregator, times,
>>>>                                 &cpu_changed_states);
>>>>                 changed_states |= cpu_changed_states;
>>>>
>>>> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
>>>
>>> No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
>>> flags should be independent and aggregated into changed_states without
>>> affecting each other. Something similar to how I suggested with
>>> PSI_STATE_WAKE_CLOCK in
>>> https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
>>
>> If we don't clear PSI_NONIDLE of this CPU, changed_states |= cpu_changed_states;
>> so changed_states has PSI_NONIDLE, and we won't know if other CPUs are IDLE or not
>> in psi_avgs_work().
> 
> I was thinking something like this:
> 
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -244,6 +244,7 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
>         enum psi_states s;
>         unsigned int seq;
>         u32 state_mask;
> +       bool reschedule;
> 
>         *pchanged_states = 0;
> 
> @@ -254,6 +255,14 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
>                memcpy(times, groupc->times, sizeof(groupc->times));
>                state_mask = groupc->state_mask;
>                state_start = groupc->state_start;
> +              if (current_cpu == cpu)
> +                     reschedule = groupc->tasks[NR_RUNNING] +
> +                            groupc->tasks[NR_IOWAIT] +
> +                            groupc->tasks[NR_MEMSTALL] > 1;
> +              else
> +                     reschedule = times[PSI_NONIDLE] >
> +                            groupc->times_prev[aggregator][PSI_NONIDLE];
> +
>         } while (read_seqcount_retry(&groupc->seq, seq));
We can't use times[] here because it will incorporate currently active states
on the CPU.

Should I still need to copy groupc->tasks[] out for the current_cpu as you
suggested before?

> 
>         /* Calculate state time deltas against the previous snapshot */
> @@ -278,6 +287,8 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
>                if (delta)
>                       *pchanged_states |= (1 << s);
>         }
> +       if (reschedule)
> +              *pchanged_states |= PSI_STATE_RESCHEDULE;

Is this ok?

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4bd3913dd176..0cbcbf89a934 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
                             u32 *pchanged_states)
 {
        struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+       int current_cpu = raw_smp_processor_id();
+       bool reschedule;
        u64 now, state_start;
        enum psi_states s;
        unsigned int seq;
@@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
                memcpy(times, groupc->times, sizeof(groupc->times));
                state_mask = groupc->state_mask;
                state_start = groupc->state_start;
+               if (cpu == current_cpu)
+                       reschedule = groupc->tasks[NR_RUNNING] +
+                               groupc->tasks[NR_IOWAIT] +
+                               groupc->tasks[NR_MEMSTALL] > 1;
        } while (read_seqcount_retry(&groupc->seq, seq));

        /* Calculate state time deltas against the previous snapshot */
@@ -280,6 +286,12 @@ static void get_recent_times(struct psi_group *group, int cpu,
                if (delta)
                        *pchanged_states |= (1 << s);
        }
+
+       if (cpu != current_cpu)
+               reschedule = *pchanged_states & (1 << PSI_NONIDLE);
+
+       if (reschedule)
+               *pchanged_states |= (1 << PSI_STATE_RESCHEDULE);
 }



  parent reply	other threads:[~2022-10-13 11:07 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-13 14:08 PSI idle-shutoff Pavan Kondeti
2022-09-15  6:20 ` Pavan Kondeti
2022-09-17  5:45   ` Suren Baghdasaryan
2022-10-03  6:11     ` Suren Baghdasaryan
2022-10-05 16:32       ` Suren Baghdasaryan
2022-10-09 12:41         ` Chengming Zhou
2022-10-09 13:17           ` Chengming Zhou
2022-10-10  6:18             ` Pavan Kondeti
2022-10-10  6:43               ` Pavan Kondeti
2022-10-10  6:57                 ` [External] " Chengming Zhou
2022-10-10  8:30                   ` Chengming Zhou
2022-10-10  9:09                     ` Pavan Kondeti
2022-10-10  9:22                       ` Chengming Zhou
2022-10-10 20:59             ` Suren Baghdasaryan
2022-10-10 20:33           ` Suren Baghdasaryan
2022-10-10  5:57         ` Pavan Kondeti
2022-10-10  9:01           ` Pavan Kondeti
2022-10-10  6:25         ` Pavan Kondeti
2022-10-10 10:42 ` [PATCH] sched/psi: Fix avgs_work re-arm in psi_avgs_work() Chengming Zhou
2022-10-10 21:21   ` Suren Baghdasaryan
2022-10-11  0:07     ` Chengming Zhou
2022-10-11 17:00       ` Suren Baghdasaryan
2022-10-12  2:10         ` Chengming Zhou
2022-10-12 18:24           ` Suren Baghdasaryan
2022-10-13  2:23             ` Chengming Zhou
2022-10-13 11:06             ` Chengming Zhou [this message]
2022-10-13 15:52               ` Johannes Weiner
2022-10-13 16:10                 ` Suren Baghdasaryan
2022-10-14  2:03                   ` Chengming Zhou
2022-10-14  2:02                 ` Chengming Zhou
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2022-10-28  6:50     ` [External] " Chengming Zhou
2022-10-28 15:58       ` Suren Baghdasaryan
2022-10-28 16:05         ` Chengming Zhou
2022-10-28 19:53         ` [External] " Peter Zijlstra
2022-10-29 11:55           ` Peter Zijlstra
2022-10-29 12:40             ` Chengming Zhou
2022-10-29 18:46               ` Suren Baghdasaryan
2022-10-10 10:57 ` PSI idle-shutoff Hillf Danton
2022-10-10 21:16   ` Suren Baghdasaryan
2022-10-11 11:38     ` Hillf Danton
2022-10-11 17:11       ` Suren Baghdasaryan
2022-10-12  6:20         ` Hillf Danton
2022-10-12 15:40           ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46c6e1cc-77d3-eac1-fa18-deed2bac4a0e@bytedance.com \
    --to=zhouchengming@bytedance.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=quic_charante@quicinc.com \
    --cc=quic_pkondeti@quicinc.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.