All of lore.kernel.org
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: PSI idle-shutoff
Date: Mon, 10 Oct 2022 13:59:12 -0700	[thread overview]
Message-ID: <CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com> (raw)
In-Reply-To: <ff2addac-5a6c-1aa5-5f1c-d62b0444ae4c@bytedance.com>

On Sun, Oct 9, 2022 at 6:17 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2022/10/9 20:41, Chengming Zhou wrote:
> > Hello,
> >
> > I just saw these emails, sorry for late.
> >
> > On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> >> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
> >>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>
> >>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> >>>> <quic_pkondeti@quicinc.com> wrote:
> >>>>>
> >>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> >>>>>> Hi
> >>>>>>
> >>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> >>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> >>>>>> there is a RUNNING task. So we would always end up re-arming the work.
> >>>>>>
> >>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> >>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> >>>>>> help. The work is already scheduled. so we don't do anything there.
> >>>>
> >>>> Hi Pavan,
> >>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> >>>> issue. At the time it was written I tested it and it seemed to work.
> >>>> Maybe I missed something or some other change introduced afterwards
> >>>> affected the shutoff logic. I'll take a closer look next week when I'm
> >>>> back at my computer and will consult with Johannes.
> >>>
> >>> Sorry for the delay. I had some time to look into this and test psi
> >>> shutoff on my device and I think you are right. The patch I mentioned
> >>> prevents new psi_avgs_work from being scheduled when the only non-idle
> >>> task is psi_avgs_work itself, however the regular 2sec averaging work
> >>> will still go on. I think we could record the fact that the only
> >>> active task is psi_avgs_work in record_times() using a new
> >>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> >>> rescheduling itself if that flag is set for all non-idle cpus. I'll
> >>> test this approach and will post a patch for review if that works.
> >>
> >> Hi Pavan,
> >> Testing PSI shutoff on Android proved more difficult than I expected.
> >> Lots of tasks to silence and I keep encountering new ones.
> >> The approach I was thinking about is something like this:
> >>
> >> ---
> >>  include/linux/psi_types.h |  3 +++
> >>  kernel/sched/psi.c        | 12 +++++++++---
> >>  2 files changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> >> index c7fe7c089718..8d936f22cb5b 100644
> >> --- a/include/linux/psi_types.h
> >> +++ b/include/linux/psi_types.h
> >> @@ -68,6 +68,9 @@ enum psi_states {
> >>          NR_PSI_STATES = 7,
> >>  };
> >>
> >> +/* state_mask flag to keep re-arming averaging work */
> >> +#define PSI_STATE_WAKE_CLOCK        (1 << NR_PSI_STATES)
> >> +
> >>  enum psi_aggregators {
> >>          PSI_AVGS = 0,
> >>          PSI_POLL,
> >> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >> index ecb4b4ff4ce0..dd62ad28bacd 100644
> >> --- a/kernel/sched/psi.c
> >> +++ b/kernel/sched/psi.c
> >> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> >> *group, int cpu,
> >>                  if (delta)
> >>                          *pchanged_states |= (1 << s);
> >>          }
> >> +        *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> >
> > If the avgs_work kworker is running on this CPU, it will still see
> > PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
> >
> > Maybe I missed something... but I have another different idea which
> > I want to implement later only for discussion.
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ee2ecc081422..f322e8fd8d41 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
>                              enum psi_aggregators aggregator, u32 *times,
>                              u32 *pchanged_states)
>  {
> +       int current_cpu = raw_smp_processor_id();
>         struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>         u64 now, state_start;
>         enum psi_states s;
>         unsigned int seq;
>         u32 state_mask;
> +       bool only_avgs_work = false;
>
>         *pchanged_states = 0;
>
> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
>                 memcpy(times, groupc->times, sizeof(groupc->times));
>                 state_mask = groupc->state_mask;
>                 state_start = groupc->state_start;
> +               /*
> +                * This CPU has only avgs_work kworker running, snapshot the
> +                * newest times then don't need to re-arm work for this groupc.
> +                * Normally this kworker will sleep soon and won't
> +                * wake_clock in psi_group_change().
> +                */
> +               if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
> +                       only_avgs_work = true;

Why do you determine only_avgs_work while taking a snapshot? The
read_seqcount_retry() might fail and the loop gets retried, which
might lead to a wrong only_avgs_work value if the state changes
between retries. I think it's safer to do this after the snapshot was
taken and to use tasks[NR_RUNNING] instead of groupc->tasks.

>         } while (read_seqcount_retry(&groupc->seq, seq));
>
>         /* Calculate state time deltas against the previous snapshot */
> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>                 if (delta)
>                         *pchanged_states |= (1 << s);
>         }
> +
> +       /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> +       if (only_avgs_work)
> +               *pchanged_states &= ~(1 << PSI_NONIDLE);

This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
only for re-arming psi_avgs_work, however semantically this is
incorrect. The CPU was not idle when it was executing psi_avgs_work.
IMO a separate flag would avoid this confusion.

>  }
>
>  static void calc_avgs(unsigned long avg[3], int missed_periods,
>
>
> >
> > Thanks.
> >
> >>  }
> >>
> >>  static void calc_avgs(unsigned long avg[3], int missed_periods,
> >> @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> >>          struct delayed_work *dwork;
> >>          struct psi_group *group;
> >>          u32 changed_states;
> >> -        bool nonidle;
> >> +        bool wake_clock;
> >>          u64 now;
> >>
> >>          dwork = to_delayed_work(work);
> >> @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> >>          now = sched_clock();
> >>
> >>          collect_percpu_times(group, PSI_AVGS, &changed_states);
> >> -        nonidle = changed_states & (1 << PSI_NONIDLE);
> >> +        wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> >>          /*
> >>           * If there is task activity, periodically fold the per-cpu
> >>           * times and feed samples into the running averages. If things
> >> @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> >>          if (now >= group->avg_next_update)
> >>                  group->avg_next_update = update_averages(group, now);
> >>
> >> -        if (nonidle) {
> >> +        if (wake_clock) {
> >>                  schedule_delayed_work(dwork, nsecs_to_jiffies(
> >>                                  group->avg_next_update - now) + 1);
> >>          }
> >> @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> >> *group, int cpu,
> >>          if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> >>                  state_mask |= (1 << PSI_MEM_FULL);
> >>
> >> +        if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> >> +                /* psi_avgs_work was not the only task on the CPU */
> >> +                state_mask |= PSI_STATE_WAKE_CLOCK;
> >> +        }
> >> +
> >>          groupc->state_mask = state_mask;
> >>
> >>          write_seqcount_end(&groupc->seq);

  parent reply	other threads:[~2022-10-10 20:59 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-13 14:08 PSI idle-shutoff Pavan Kondeti
2022-09-15  6:20 ` Pavan Kondeti
2022-09-17  5:45   ` Suren Baghdasaryan
2022-10-03  6:11     ` Suren Baghdasaryan
2022-10-05 16:32       ` Suren Baghdasaryan
2022-10-09 12:41         ` Chengming Zhou
2022-10-09 13:17           ` Chengming Zhou
2022-10-10  6:18             ` Pavan Kondeti
2022-10-10  6:43               ` Pavan Kondeti
2022-10-10  6:57                 ` [External] " Chengming Zhou
2022-10-10  8:30                   ` Chengming Zhou
2022-10-10  9:09                     ` Pavan Kondeti
2022-10-10  9:22                       ` Chengming Zhou
2022-10-10 20:59             ` Suren Baghdasaryan [this message]
2022-10-10 20:33           ` Suren Baghdasaryan
2022-10-10  5:57         ` Pavan Kondeti
2022-10-10  9:01           ` Pavan Kondeti
2022-10-10  6:25         ` Pavan Kondeti
2022-10-10 10:42 ` [PATCH] sched/psi: Fix avgs_work re-arm in psi_avgs_work() Chengming Zhou
2022-10-10 21:21   ` Suren Baghdasaryan
2022-10-11  0:07     ` Chengming Zhou
2022-10-11 17:00       ` Suren Baghdasaryan
2022-10-12  2:10         ` Chengming Zhou
2022-10-12 18:24           ` Suren Baghdasaryan
2022-10-13  2:23             ` Chengming Zhou
2022-10-13 11:06             ` Chengming Zhou
2022-10-13 15:52               ` Johannes Weiner
2022-10-13 16:10                 ` Suren Baghdasaryan
2022-10-14  2:03                   ` Chengming Zhou
2022-10-14  2:02                 ` Chengming Zhou
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2022-10-28  6:50     ` [External] " Chengming Zhou
2022-10-28 15:58       ` Suren Baghdasaryan
2022-10-28 16:05         ` Chengming Zhou
2022-10-28 19:53         ` [External] " Peter Zijlstra
2022-10-29 11:55           ` Peter Zijlstra
2022-10-29 12:40             ` Chengming Zhou
2022-10-29 18:46               ` Suren Baghdasaryan
2022-10-10 10:57 ` PSI idle-shutoff Hillf Danton
2022-10-10 21:16   ` Suren Baghdasaryan
2022-10-11 11:38     ` Hillf Danton
2022-10-11 17:11       ` Suren Baghdasaryan
2022-10-12  6:20         ` Hillf Danton
2022-10-12 15:40           ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com \
    --to=surenb@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=quic_charante@quicinc.com \
    --cc=quic_pkondeti@quicinc.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.