All of lore.kernel.org
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: PSI idle-shutoff
Date: Wed, 5 Oct 2022 09:32:44 -0700	[thread overview]
Message-ID: <CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com> (raw)
In-Reply-To: <CAJuCfpEeNzDQ-CvMN3fP5LejOzpnfgUgvkzpPj1CLF-8NqNoww@mail.gmail.com>

On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > <quic_pkondeti@quicinc.com> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > > Hi
> > > >
> > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > >
> > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > help. The work is already scheduled. so we don't do anything there.
> >
> > Hi Pavan,
> > Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > issue. At the time it was written I tested it and it seemed to work.
> > Maybe I missed something or some other change introduced afterwards
> > affected the shutoff logic. I'll take a closer look next week when I'm
> > back at my computer and will consult with Johannes.
>
> Sorry for the delay. I had some time to look into this and test psi
> shutoff on my device and I think you are right. The patch I mentioned
> prevents new psi_avgs_work from being scheduled when the only non-idle
> task is psi_avgs_work itself, however the regular 2sec averaging work
> will still go on. I think we could record the fact that the only
> active task is psi_avgs_work in record_times() using a new
> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> rescheduling itself if that flag is set for all non-idle cpus. I'll
> test this approach and will post a patch for review if that works.

Hi Pavan,
Testing PSI shutoff on Android proved more difficult than I expected.
Lots of tasks to silence and I keep encountering new ones.
The approach I was thinking about is something like this:

---
 include/linux/psi_types.h |  3 +++
 kernel/sched/psi.c        | 12 +++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..8d936f22cb5b 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -68,6 +68,9 @@ enum psi_states {
         NR_PSI_STATES = 7,
 };

+/* state_mask flag to keep re-arming averaging work */
+#define PSI_STATE_WAKE_CLOCK        (1 << NR_PSI_STATES)
+
 enum psi_aggregators {
         PSI_AVGS = 0,
         PSI_POLL,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0..dd62ad28bacd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
*group, int cpu,
                 if (delta)
                         *pchanged_states |= (1 << s);
         }
+        *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
 }

 static void calc_avgs(unsigned long avg[3], int missed_periods,
@@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
         struct delayed_work *dwork;
         struct psi_group *group;
         u32 changed_states;
-        bool nonidle;
+        bool wake_clock;
         u64 now;

         dwork = to_delayed_work(work);
@@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
         now = sched_clock();

         collect_percpu_times(group, PSI_AVGS, &changed_states);
-        nonidle = changed_states & (1 << PSI_NONIDLE);
+        wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
         /*
          * If there is task activity, periodically fold the per-cpu
          * times and feed samples into the running averages. If things
@@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
         if (now >= group->avg_next_update)
                 group->avg_next_update = update_averages(group, now);

-        if (nonidle) {
+        if (wake_clock) {
                 schedule_delayed_work(dwork, nsecs_to_jiffies(
                                 group->avg_next_update - now) + 1);
         }
@@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
*group, int cpu,
         if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
                 state_mask |= (1 << PSI_MEM_FULL);

+        if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
+                /* psi_avgs_work was not the only task on the CPU */
+                state_mask |= PSI_STATE_WAKE_CLOCK;
+        }
+
         groupc->state_mask = state_mask;

         write_seqcount_end(&groupc->seq);
-- 

This should detect the activity caused by psi_avgs_work() itself and
ignore it when deciding to reschedule the averaging work. In the
formula you posted:

non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)

the first term is calculated only if the PSI state is still active
(https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L271).
psi_group_change() will reset that state if psi_avgs_work() was the
only task on that CPU, so it won't affect non_idle_time. The code
above is to take care of the second term. Could you please check if
this approach helps? As I mentioned I'm having trouble getting all the
tasks silent on Android for a clear test.

The issue with deferrable timers that you mentioned, how often does
that happen? If it happens only occasionally and prevents PSI shutoff
for a couple of update cycles then I don't think that's a huge
problem. Once PSI shutoff happens it should stay shut. Is that the
case?
Thanks,
Suren.


> Thanks,
> Suren.
>
> > Thanks,
> > Suren.
> >
> > [1] 1b69ac6b40eb "psi: fix aggregation idle shut-off"
> >
> > > >
> > > > Probably I am missing some thing here. Can you please clarify how we
> > > > shut off re-arming the psi avg work?
> > > >
> > >
> > > I have collected traces on an idle system (running android12-5.10 with minimal
> > > user space). This is a older kernel, however the issue remain on latest kernel
> > > as per code inspection.
> > >
> > > I have eliminated noise created by other work items. For example, vmstat_work.
> > > This is a deferrable work but gets executed since this is queued on the same
> > > CPU on which PSI work timer is queued. So I have increased
> > > sysctl_stat_interval to 60 * HZ to supress this work.
> > >
> > > As we can see from the traces, CPU#7 comes out of idle only to execute PSI
> > > work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
> > > as it finds PSI_NONIDLE condition. The non-idle time is essentially
> > >
> > > non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
> > >
> > > The first term accounts the non-idle time since the task woken up (queued) to
> > > the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
> > >
> > > The second term account for the previous update. ~2 usec (52.135424 -
> > > 52.135422).
> > >
> > > PSI work needs to be run when there is some activity after the last update is done
> > > i.e last time the work is run. Since we use non-deferrable timer, the other
> > > deferrable timers gets woken up and they might queue work or wakeup other threads
> > > and creates activity which inturn makes PSI work to be scheduled.
> > >
> > > PSI work can't just be made deferrable work. Because, it is a system level
> > > work and if the CPU on which it is queued is idle for longer duration but the
> > > other CPUs are active, we miss PSI updates. What we probably need is a global
> > > deferrable timers [1] i.e this timer should not be bound to any CPU but
> > > run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
> > > running the PSI but if the whole system is idle, we never wakeup.
> > >
> > >           <idle>-0     [007]    52.135402: cpu_idle:             state=4294967295 cpu_id=7
> > >           <idle>-0     [007]    52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
> > >           <idle>-0     [007]    52.135417: sched_wakeup:         comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > >           <idle>-0     [007]    52.135421: sched_switch:         prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > >      kworker/7:3-196   [007]    52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > >      kworker/7:3-196   [007]    52.135422: timer_start:          timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
> > >      kworker/7:3-196   [007]    52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> > >      kworker/7:3-196   [007]    52.135424: sched_switch:         prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
> > >           <idle>-0     [007]    52.135428: cpu_idle:             state=0 cpu_id=7
> > >
> > >           <system is idle and gets woken up after 2 seconds due to PSI work>
> > >
> > >           <idle>-0     [007]    54.119402: cpu_idle:             state=4294967295 cpu_id=7
> > >           <idle>-0     [007]    54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
> > >           <idle>-0     [007]    54.119416: sched_wakeup:         comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > >           <idle>-0     [007]    54.119420: sched_switch:         prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > >      kworker/7:3-196   [007]    54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > >      kworker/7:3-196   [007]    54.119421: timer_start:          timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
> > >      kworker/7:3-196   [007]    54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> > >
> > > [1]
> > > https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/
> > >
> > > Thanks,
> > > Pavan

  reply	other threads:[~2022-10-05 16:33 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-13 14:08 PSI idle-shutoff Pavan Kondeti
2022-09-15  6:20 ` Pavan Kondeti
2022-09-17  5:45   ` Suren Baghdasaryan
2022-10-03  6:11     ` Suren Baghdasaryan
2022-10-05 16:32       ` Suren Baghdasaryan [this message]
2022-10-09 12:41         ` Chengming Zhou
2022-10-09 13:17           ` Chengming Zhou
2022-10-10  6:18             ` Pavan Kondeti
2022-10-10  6:43               ` Pavan Kondeti
2022-10-10  6:57                 ` [External] " Chengming Zhou
2022-10-10  8:30                   ` Chengming Zhou
2022-10-10  9:09                     ` Pavan Kondeti
2022-10-10  9:22                       ` Chengming Zhou
2022-10-10 20:59             ` Suren Baghdasaryan
2022-10-10 20:33           ` Suren Baghdasaryan
2022-10-10  5:57         ` Pavan Kondeti
2022-10-10  9:01           ` Pavan Kondeti
2022-10-10  6:25         ` Pavan Kondeti
2022-10-10 10:42 ` [PATCH] sched/psi: Fix avgs_work re-arm in psi_avgs_work() Chengming Zhou
2022-10-10 21:21   ` Suren Baghdasaryan
2022-10-11  0:07     ` Chengming Zhou
2022-10-11 17:00       ` Suren Baghdasaryan
2022-10-12  2:10         ` Chengming Zhou
2022-10-12 18:24           ` Suren Baghdasaryan
2022-10-13  2:23             ` Chengming Zhou
2022-10-13 11:06             ` Chengming Zhou
2022-10-13 15:52               ` Johannes Weiner
2022-10-13 16:10                 ` Suren Baghdasaryan
2022-10-14  2:03                   ` Chengming Zhou
2022-10-14  2:02                 ` Chengming Zhou
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2022-10-28  6:50     ` [External] " Chengming Zhou
2022-10-28 15:58       ` Suren Baghdasaryan
2022-10-28 16:05         ` Chengming Zhou
2022-10-28 19:53         ` [External] " Peter Zijlstra
2022-10-29 11:55           ` Peter Zijlstra
2022-10-29 12:40             ` Chengming Zhou
2022-10-29 18:46               ` Suren Baghdasaryan
2022-10-10 10:57 ` PSI idle-shutoff Hillf Danton
2022-10-10 21:16   ` Suren Baghdasaryan
2022-10-11 11:38     ` Hillf Danton
2022-10-11 17:11       ` Suren Baghdasaryan
2022-10-12  6:20         ` Hillf Danton
2022-10-12 15:40           ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com' \
    --to=surenb@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=quic_charante@quicinc.com \
    --cc=quic_pkondeti@quicinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.