linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
@ 2022-10-10 22:57 Suren Baghdasaryan
  2022-10-20 22:25 ` Suren Baghdasaryan
       [not found] ` <20221021013613.1428-1-hdanton@sina.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-10 22:57 UTC (permalink / raw)
  To: peterz
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, matthias.bgg, minchan,
	yt.chang, wenju.xu, jonathan.jmchen, show-hong.chen,
	linux-kernel, linux-arm-kernel, linux-mediatek, kernel-team,
	surenb

Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&group->poll_wait);

psi_poll_worker
  wait_event_interruptible(group->poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&group->poll_timer)) return;
      ...
      mod_timer(&group->poll_timer, jiffies + delay);

Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
This patch somehow slipped through the cracks after being acked by Johannes in
[1] and I didn't notice it until now because we cherry-picked it into Android
kernel trees due to the urgency at that time. On the bright side, this change
has been tested for about a year in the field by millions of devices.
Resending v4 of this patch previously posted at [2], rebased on the latest
Linus' TOT.

[1] https://lore.kernel.org/lkml/YOdwxh3487PeMHRX@cmpxchg.org/
[2] https://lore.kernel.org/lkml/20210708203648.2399667-1-surenb@google.com/

 include/linux/psi_types.h |  1 +
 kernel/sched/psi.c        | 60 +++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..3f78c9bf7bb1 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -170,6 +170,7 @@ struct psi_group {
 	struct timer_list poll_timer;
 	wait_queue_head_t poll_wait;
 	atomic_t poll_wakeup;
+	atomic_t poll_scheduled;
 
 	/* Protects data used by the monitor */
 	struct mutex trigger_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 7f6030091aee..2f548beeae50 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -188,6 +188,7 @@ static void group_init(struct psi_group *group)
 	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
 	mutex_init(&group->avgs_lock);
 	/* Init trigger-related members */
+	atomic_set(&group->poll_scheduled, 0);
 	mutex_init(&group->trigger_lock);
 	INIT_LIST_HEAD(&group->triggers);
 	group->poll_min_period = U32_MAX;
@@ -561,18 +562,17 @@ static u64 update_triggers(struct psi_group *group, u64 now)
 	return now + group->poll_min_period;
 }
 
-/* Schedule polling if it's not already scheduled. */
-static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
+/* Schedule polling if it's not already scheduled or forced. */
+static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
+				   bool force)
 {
 	struct task_struct *task;
 
 	/*
-	 * Do not reschedule if already scheduled.
-	 * Possible race with a timer scheduled after this check but before
-	 * mod_timer below can be tolerated because group->polling_next_update
-	 * will keep updates on schedule.
+	 * atomic_xchg should be called even when !force to provide a
+	 * full memory barrier (see the comment inside psi_poll_work).
 	 */
-	if (timer_pending(&group->poll_timer))
+	if (atomic_xchg(&group->poll_scheduled, 1) && !force)
 		return;
 
 	rcu_read_lock();
@@ -584,12 +584,15 @@ static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
 	 */
 	if (likely(task))
 		mod_timer(&group->poll_timer, jiffies + delay);
+	else
+		atomic_set(&group->poll_scheduled, 0);
 
 	rcu_read_unlock();
 }
 
 static void psi_poll_work(struct psi_group *group)
 {
+	bool force_reschedule = false;
 	u32 changed_states;
 	u64 now;
 
@@ -597,6 +600,43 @@ static void psi_poll_work(struct psi_group *group)
 
 	now = sched_clock();
 
+	if (now > group->polling_until) {
+		/*
+		 * We are either about to start or might stop polling if no
+		 * state change was recorded. Resetting poll_scheduled leaves
+		 * a small window for psi_group_change to sneak in and schedule
+		 * an immegiate poll_work before we get to rescheduling. One
+		 * potential extra wakeup at the end of the polling window
+		 * should be negligible and polling_next_update still keeps
+		 * updates correctly on schedule.
+		 */
+		atomic_set(&group->poll_scheduled, 0);
+		/*
+		 * A task change can race with the poll worker that is supposed to
+		 * report on it. To avoid missing events, ensure ordering between
+		 * poll_scheduled and the task state accesses, such that if the poll
+		 * worker misses the state update, the task change is guaranteed to
+		 * reschedule the poll worker:
+		 *
+		 * poll worker:
+		 *   atomic_set(poll_scheduled, 0)
+		 *   smp_mb()
+		 *   LOAD states
+		 *
+		 * task change:
+		 *   STORE states
+		 *   if atomic_xchg(poll_scheduled, 1) == 0:
+		 *     schedule poll worker
+		 *
+		 * The atomic_xchg() implies a full barrier.
+		 */
+		smp_mb();
+	} else {
+		/* Polling window is not over, keep rescheduling */
+		force_reschedule = true;
+	}
+
+
 	collect_percpu_times(group, PSI_POLL, &changed_states);
 
 	if (changed_states & group->poll_states) {
@@ -622,7 +662,8 @@ static void psi_poll_work(struct psi_group *group)
 		group->polling_next_update = update_triggers(group, now);
 
 	psi_schedule_poll_work(group,
-		nsecs_to_jiffies(group->polling_next_update - now) + 1);
+		nsecs_to_jiffies(group->polling_next_update - now) + 1,
+		force_reschedule);
 
 out:
 	mutex_unlock(&group->trigger_lock);
@@ -747,7 +788,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	write_seqcount_end(&groupc->seq);
 
 	if (state_mask & group->poll_states)
-		psi_schedule_poll_work(group, 1);
+		psi_schedule_poll_work(group, 1, false);
 
 	if (wake_clock && !delayed_work_pending(&group->avgs_work))
 		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
@@ -1223,6 +1264,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
 		 * can no longer be found through group->poll_task.
 		 */
 		kthread_stop(task_to_destroy);
+		atomic_set(&group->poll_scheduled, 0);
 	}
 	kfree(t);
 }
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
  2022-10-10 22:57 [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling Suren Baghdasaryan
@ 2022-10-20 22:25 ` Suren Baghdasaryan
  2022-10-24  9:56   ` Peter Zijlstra
       [not found] ` <20221021013613.1428-1-hdanton@sina.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-20 22:25 UTC (permalink / raw)
  To: peterz
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, matthias.bgg, minchan,
	yt.chang, wenju.xu, jonathan.jmchen, show-hong.chen,
	linux-kernel, linux-arm-kernel, linux-mediatek, kernel-team

On Mon, Oct 10, 2022 at 3:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Psi polling mechanism is trying to minimize the number of wakeups to
> run psi_poll_work and is currently relying on timer_pending() to detect
> when this work is already scheduled. This provides a window of opportunity
> for psi_group_change to schedule an immediate psi_poll_work after
> poll_timer_fn got called but before psi_poll_work could reschedule itself.
> Below is the depiction of this entire window:
>
> poll_timer_fn
>   wake_up_interruptible(&group->poll_wait);
>
> psi_poll_worker
>   wait_event_interruptible(group->poll_wait, ...)
>   psi_poll_work
>     psi_schedule_poll_work
>       if (timer_pending(&group->poll_timer)) return;
>       ...
>       mod_timer(&group->poll_timer, jiffies + delay);
>
> Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
> reset and set back inside psi_poll_work and therefore this race window
> was much smaller.
> The larger window causes increased number of wakeups and our partners
> report visible power regression of ~10mA after applying 461daba06bdc.
> Bring back the poll_scheduled atomic and make this race window even
> narrower by resetting poll_scheduled only when we reach polling expiration
> time. This does not completely eliminate the possibility of extra wakeups
> caused by a race with psi_group_change however it will limit it to the
> worst case scenario of one extra wakeup per every tracking window (0.5s
> in the worst case).
> This patch also ensures correct ordering between clearing poll_scheduled
> flag and obtaining changed_states using memory barrier. Correct ordering
> between updating changed_states and setting poll_scheduled is ensured by
> atomic_xchg operation.
> By tracing the number of immediate rescheduling attempts performed by
> psi_group_change and the number of these attempts being blocked due to
> psi monitor being already active, we can assess the effects of this change:
>
> Before the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules attempted:           684365   1385156    1261240
> Immediate reschedules blocked:             682846   1381654    1258682
> Immediate reschedules (delta):             1519     3502       2558
> Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
>
> After the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules attempted:           882244   770298    426218
> Immediate reschedules blocked:             881996   769796    426074
> Immediate reschedules (delta):             248      502       144
> Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
>
> The number of non-blocked immediate reschedules dropped from 0.22-0.25%
> to 0.03-0.07%. The drop is attributed to the decrease in the race window
> size and the fact that we allow this race only when psi monitors reach
> polling window expiration time.
>
> Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
> Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Tested-by: SH Chen <show-hong.chen@mediatek.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> This patch somehow slipped through the cracks after being acked by Johannes in
> [1] and I didn't notice it until now because we cherry-picked it into Android
> kernel trees due to the urgency at that time. On the bright side, this change
> has been tested for about a year in the field by millions of devices.
> Resending v4 of this patch previously posted at [2], rebased on the latest
> Linus' TOT.

Hi Peter,
We missed this Ack'ed patch last year and as I described above I
didn't notice that up until now. With rc1 released, hopefully it's a
good time to ping you to ask for inclusion of this patch in your tree.
If the timing is not good, please let me know when to remind you and
I'll send another email. Just want to make sure it does not slip
again.

Just FYI, we have two other Ack'ed PSI patches for you to consider:

https://lore.kernel.org/all/20221014110551.22695-1-zhouchengming@bytedance.com/
https://lore.kernel.org/all/20220919072356.GA29069@haolee.io/

Thanks,
Suren.



>
> [1] https://lore.kernel.org/lkml/YOdwxh3487PeMHRX@cmpxchg.org/
> [2] https://lore.kernel.org/lkml/20210708203648.2399667-1-surenb@google.com/
>
>  include/linux/psi_types.h |  1 +
>  kernel/sched/psi.c        | 60 +++++++++++++++++++++++++++++++++------
>  2 files changed, 52 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..3f78c9bf7bb1 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -170,6 +170,7 @@ struct psi_group {
>         struct timer_list poll_timer;
>         wait_queue_head_t poll_wait;
>         atomic_t poll_wakeup;
> +       atomic_t poll_scheduled;
>
>         /* Protects data used by the monitor */
>         struct mutex trigger_lock;
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 7f6030091aee..2f548beeae50 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -188,6 +188,7 @@ static void group_init(struct psi_group *group)
>         INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>         mutex_init(&group->avgs_lock);
>         /* Init trigger-related members */
> +       atomic_set(&group->poll_scheduled, 0);
>         mutex_init(&group->trigger_lock);
>         INIT_LIST_HEAD(&group->triggers);
>         group->poll_min_period = U32_MAX;
> @@ -561,18 +562,17 @@ static u64 update_triggers(struct psi_group *group, u64 now)
>         return now + group->poll_min_period;
>  }
>
> -/* Schedule polling if it's not already scheduled. */
> -static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
> +/* Schedule polling if it's not already scheduled or forced. */
> +static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
> +                                  bool force)
>  {
>         struct task_struct *task;
>
>         /*
> -        * Do not reschedule if already scheduled.
> -        * Possible race with a timer scheduled after this check but before
> -        * mod_timer below can be tolerated because group->polling_next_update
> -        * will keep updates on schedule.
> +        * atomic_xchg should be called even when !force to provide a
> +        * full memory barrier (see the comment inside psi_poll_work).
>          */
> -       if (timer_pending(&group->poll_timer))
> +       if (atomic_xchg(&group->poll_scheduled, 1) && !force)
>                 return;
>
>         rcu_read_lock();
> @@ -584,12 +584,15 @@ static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
>          */
>         if (likely(task))
>                 mod_timer(&group->poll_timer, jiffies + delay);
> +       else
> +               atomic_set(&group->poll_scheduled, 0);
>
>         rcu_read_unlock();
>  }
>
>  static void psi_poll_work(struct psi_group *group)
>  {
> +       bool force_reschedule = false;
>         u32 changed_states;
>         u64 now;
>
> @@ -597,6 +600,43 @@ static void psi_poll_work(struct psi_group *group)
>
>         now = sched_clock();
>
> +       if (now > group->polling_until) {
> +               /*
> +                * We are either about to start or might stop polling if no
> +                * state change was recorded. Resetting poll_scheduled leaves
> +                * a small window for psi_group_change to sneak in and schedule
> +                * an immegiate poll_work before we get to rescheduling. One
> +                * potential extra wakeup at the end of the polling window
> +                * should be negligible and polling_next_update still keeps
> +                * updates correctly on schedule.
> +                */
> +               atomic_set(&group->poll_scheduled, 0);
> +               /*
> +                * A task change can race with the poll worker that is supposed to
> +                * report on it. To avoid missing events, ensure ordering between
> +                * poll_scheduled and the task state accesses, such that if the poll
> +                * worker misses the state update, the task change is guaranteed to
> +                * reschedule the poll worker:
> +                *
> +                * poll worker:
> +                *   atomic_set(poll_scheduled, 0)
> +                *   smp_mb()
> +                *   LOAD states
> +                *
> +                * task change:
> +                *   STORE states
> +                *   if atomic_xchg(poll_scheduled, 1) == 0:
> +                *     schedule poll worker
> +                *
> +                * The atomic_xchg() implies a full barrier.
> +                */
> +               smp_mb();
> +       } else {
> +               /* Polling window is not over, keep rescheduling */
> +               force_reschedule = true;
> +       }
> +
> +
>         collect_percpu_times(group, PSI_POLL, &changed_states);
>
>         if (changed_states & group->poll_states) {
> @@ -622,7 +662,8 @@ static void psi_poll_work(struct psi_group *group)
>                 group->polling_next_update = update_triggers(group, now);
>
>         psi_schedule_poll_work(group,
> -               nsecs_to_jiffies(group->polling_next_update - now) + 1);
> +               nsecs_to_jiffies(group->polling_next_update - now) + 1,
> +               force_reschedule);
>
>  out:
>         mutex_unlock(&group->trigger_lock);
> @@ -747,7 +788,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>         write_seqcount_end(&groupc->seq);
>
>         if (state_mask & group->poll_states)
> -               psi_schedule_poll_work(group, 1);
> +               psi_schedule_poll_work(group, 1, false);
>
>         if (wake_clock && !delayed_work_pending(&group->avgs_work))
>                 schedule_delayed_work(&group->avgs_work, PSI_FREQ);
> @@ -1223,6 +1264,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
>                  * can no longer be found through group->poll_task.
>                  */
>                 kthread_stop(task_to_destroy);
> +               atomic_set(&group->poll_scheduled, 0);
>         }
>         kfree(t);
>  }
> --
> 2.38.0.rc1.362.ged0d419d3c-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
       [not found] ` <20221021013613.1428-1-hdanton@sina.com>
@ 2022-10-21 19:54   ` Suren Baghdasaryan
       [not found]     ` <20221022035200.1911-1-hdanton@sina.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-21 19:54 UTC (permalink / raw)
  To: Hillf Danton; +Cc: peterz, hannes, linux-kernel

On Thu, Oct 20, 2022 at 7:11 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On 10 Oct 2022 15:57:44 -0700 Suren Baghdasaryan <surenb@google.com>
> > Psi polling mechanism is trying to minimize the number of wakeups to
> > run psi_poll_work and is currently relying on timer_pending() to detect
> > when this work is already scheduled. This provides a window of opportunity
> > for psi_group_change to schedule an immediate psi_poll_work after
> > poll_timer_fn got called but before psi_poll_work could reschedule itself.
> > Below is the depiction of this entire window:
> >
> > poll_timer_fn
> >   wake_up_interruptible(&group->poll_wait);
> >
> > psi_poll_worker
> >   wait_event_interruptible(group->poll_wait, ...)
> >   psi_poll_work
> >     psi_schedule_poll_work
> >       if (timer_pending(&group->poll_timer)) return;
> >       ...
> >       mod_timer(&group->poll_timer, jiffies + delay);
>
> [...]
>
> >
> > -/* Schedule polling if it's not already scheduled. */
> > -static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
> > +/* Schedule polling if it's not already scheduled or forced. */
> > +static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
> > +                                bool force)
> >  {
> >       struct task_struct *task;
> >
> >       /*
> > -      * Do not reschedule if already scheduled.
> > -      * Possible race with a timer scheduled after this check but before
> > -      * mod_timer below can be tolerated because group->polling_next_update
> > -      * will keep updates on schedule.
> > +      * atomic_xchg should be called even when !force to provide a
> > +      * full memory barrier (see the comment inside psi_poll_work).
> >        */
> > -     if (timer_pending(&group->poll_timer))
> > +     if (atomic_xchg(&group->poll_scheduled, 1) && !force)
> >               return;
>
> If poll_scheduled works, turning poll_timer, which only wakes up poll
> worker, to a delayed work also works because schedule_delayed_work()
> takes care of pending work, with the bonus of cutting poll worker.

Thanks for the suggestion, Hillf.
psi_poll_worker runs at a low FIFO priority to prevent normal tasks
from preempting PSI signal generation (see sched_set_fifo_low() call
inside psi_poll_worker()), so schedule_delayed_work() would not be
usable as is I think, since it uses normal priority system_wq. I would
probably need to use queue_delayed_work() with a dedicated workqueue
that uses a worker with worker->task set to the same FIFO priority.
However I'm not sure it's worth creating a workqueue for only one task
that might be scheduled in it...
Thanks,
Suren.

>
> Hillf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
       [not found]     ` <20221022035200.1911-1-hdanton@sina.com>
@ 2022-10-22  4:03       ` Suren Baghdasaryan
  0 siblings, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-22  4:03 UTC (permalink / raw)
  To: Hillf Danton; +Cc: peterz, hannes, linux-kernel

On Fri, Oct 21, 2022 at 8:52 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On 21 Oct 2022 12:54:16 -0700 Suren Baghdasaryan <surenb@google.com>
> > psi_poll_worker runs at a low FIFO priority to prevent normal tasks
> > from preempting PSI signal generation (see sched_set_fifo_low() call
> > inside psi_poll_worker()), so schedule_delayed_work() would not be
> > usable as is I think, since it uses normal priority system_wq.
>
> I missed FIFO and sorry for my noise.

No issues at all. I appreciate your input!

>
> Hillf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
  2022-10-20 22:25 ` Suren Baghdasaryan
@ 2022-10-24  9:56   ` Peter Zijlstra
  2022-10-24 20:45     ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2022-10-24  9:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, matthias.bgg, minchan,
	yt.chang, wenju.xu, jonathan.jmchen, show-hong.chen,
	linux-kernel, linux-arm-kernel, linux-mediatek, kernel-team

On Thu, Oct 20, 2022 at 03:25:47PM -0700, Suren Baghdasaryan wrote:
> On Mon, Oct 10, 2022 at 3:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Psi polling mechanism is trying to minimize the number of wakeups to
> > run psi_poll_work and is currently relying on timer_pending() to detect
> > when this work is already scheduled. This provides a window of opportunity
> > for psi_group_change to schedule an immediate psi_poll_work after
> > poll_timer_fn got called but before psi_poll_work could reschedule itself.
> > Below is the depiction of this entire window:
> >
> > poll_timer_fn
> >   wake_up_interruptible(&group->poll_wait);
> >
> > psi_poll_worker
> >   wait_event_interruptible(group->poll_wait, ...)
> >   psi_poll_work
> >     psi_schedule_poll_work
> >       if (timer_pending(&group->poll_timer)) return;
> >       ...
> >       mod_timer(&group->poll_timer, jiffies + delay);
> >
> > Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
> > reset and set back inside psi_poll_work and therefore this race window
> > was much smaller.
> > The larger window causes increased number of wakeups and our partners
> > report visible power regression of ~10mA after applying 461daba06bdc.
> > Bring back the poll_scheduled atomic and make this race window even
> > narrower by resetting poll_scheduled only when we reach polling expiration
> > time. This does not completely eliminate the possibility of extra wakeups
> > caused by a race with psi_group_change however it will limit it to the
> > worst case scenario of one extra wakeup per every tracking window (0.5s
> > in the worst case).
> > This patch also ensures correct ordering between clearing poll_scheduled
> > flag and obtaining changed_states using memory barrier. Correct ordering
> > between updating changed_states and setting poll_scheduled is ensured by
> > atomic_xchg operation.
> > By tracing the number of immediate rescheduling attempts performed by
> > psi_group_change and the number of these attempts being blocked due to
> > psi monitor being already active, we can assess the effects of this change:
> >
> > Before the patch:
> >                                            Run#1    Run#2      Run#3
> > Immediate reschedules attempted:           684365   1385156    1261240
> > Immediate reschedules blocked:             682846   1381654    1258682
> > Immediate reschedules (delta):             1519     3502       2558
> > Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
> >
> > After the patch:
> >                                            Run#1    Run#2      Run#3
> > Immediate reschedules attempted:           882244   770298    426218
> > Immediate reschedules blocked:             881996   769796    426074
> > Immediate reschedules (delta):             248      502       144
> > Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
> >
> > The number of non-blocked immediate reschedules dropped from 0.22-0.25%
> > to 0.03-0.07%. The drop is attributed to the decrease in the race window
> > size and the fact that we allow this race only when psi monitors reach
> > polling window expiration time.
> >
> > Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
> > Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> > Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> > Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Tested-by: SH Chen <show-hong.chen@mediatek.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > This patch somehow slipped through the cracks after being acked by Johannes in
> > [1] and I didn't notice it until now because we cherry-picked it into Android
> > kernel trees due to the urgency at that time. On the bright side, this change
> > has been tested for about a year in the field by millions of devices.
> > Resending v4 of this patch previously posted at [2], rebased on the latest
> > Linus' TOT.
> 
> Hi Peter,
> We missed this Ack'ed patch last year and as I described above I
> didn't notice that up until now. With rc1 released, hopefully it's a
> good time to ping you to ask for inclusion of this patch in your tree.
> If the timing is not good, please let me know when to remind you and
> I'll send another email. Just want to make sure it does not slip
> again.
> 
> Just FYI, we have two other Ack'ed PSI patches for you to consider:
> 
> https://lore.kernel.org/all/20221014110551.22695-1-zhouchengming@bytedance.com/
> https://lore.kernel.org/all/20220919072356.GA29069@haolee.io/

Thanks for the poke; I've picked up all three and will place then in
sched/core.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling
  2022-10-24  9:56   ` Peter Zijlstra
@ 2022-10-24 20:45     ` Suren Baghdasaryan
  0 siblings, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2022-10-24 20:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, matthias.bgg, minchan,
	yt.chang, wenju.xu, jonathan.jmchen, show-hong.chen,
	linux-kernel, linux-arm-kernel, linux-mediatek, kernel-team

On Mon, Oct 24, 2022 at 2:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 20, 2022 at 03:25:47PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Oct 10, 2022 at 3:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Psi polling mechanism is trying to minimize the number of wakeups to
> > > run psi_poll_work and is currently relying on timer_pending() to detect
> > > when this work is already scheduled. This provides a window of opportunity
> > > for psi_group_change to schedule an immediate psi_poll_work after
> > > poll_timer_fn got called but before psi_poll_work could reschedule itself.
> > > Below is the depiction of this entire window:
> > >
> > > poll_timer_fn
> > >   wake_up_interruptible(&group->poll_wait);
> > >
> > > psi_poll_worker
> > >   wait_event_interruptible(group->poll_wait, ...)
> > >   psi_poll_work
> > >     psi_schedule_poll_work
> > >       if (timer_pending(&group->poll_timer)) return;
> > >       ...
> > >       mod_timer(&group->poll_timer, jiffies + delay);
> > >
> > > Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
> > > reset and set back inside psi_poll_work and therefore this race window
> > > was much smaller.
> > > The larger window causes increased number of wakeups and our partners
> > > report visible power regression of ~10mA after applying 461daba06bdc.
> > > Bring back the poll_scheduled atomic and make this race window even
> > > narrower by resetting poll_scheduled only when we reach polling expiration
> > > time. This does not completely eliminate the possibility of extra wakeups
> > > caused by a race with psi_group_change however it will limit it to the
> > > worst case scenario of one extra wakeup per every tracking window (0.5s
> > > in the worst case).
> > > This patch also ensures correct ordering between clearing poll_scheduled
> > > flag and obtaining changed_states using memory barrier. Correct ordering
> > > between updating changed_states and setting poll_scheduled is ensured by
> > > atomic_xchg operation.
> > > By tracing the number of immediate rescheduling attempts performed by
> > > psi_group_change and the number of these attempts being blocked due to
> > > psi monitor being already active, we can assess the effects of this change:
> > >
> > > Before the patch:
> > >                                            Run#1    Run#2      Run#3
> > > Immediate reschedules attempted:           684365   1385156    1261240
> > > Immediate reschedules blocked:             682846   1381654    1258682
> > > Immediate reschedules (delta):             1519     3502       2558
> > > Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
> > >
> > > After the patch:
> > >                                            Run#1    Run#2      Run#3
> > > Immediate reschedules attempted:           882244   770298    426218
> > > Immediate reschedules blocked:             881996   769796    426074
> > > Immediate reschedules (delta):             248      502       144
> > > Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
> > >
> > > The number of non-blocked immediate reschedules dropped from 0.22-0.25%
> > > to 0.03-0.07%. The drop is attributed to the decrease in the race window
> > > size and the fact that we allow this race only when psi monitors reach
> > > polling window expiration time.
> > >
> > > Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
> > > Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> > > Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> > > Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Tested-by: SH Chen <show-hong.chen@mediatek.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > > This patch somehow slipped through the cracks after being acked by Johannes in
> > > [1] and I didn't notice it until now because we cherry-picked it into Android
> > > kernel trees due to the urgency at that time. On the bright side, this change
> > > has been tested for about a year in the field by millions of devices.
> > > Resending v4 of this patch previously posted at [2], rebased on the latest
> > > Linus' TOT.
> >
> > Hi Peter,
> > We missed this Ack'ed patch last year and as I described above I
> > didn't notice that up until now. With rc1 released, hopefully it's a
> > good time to ping you to ask for inclusion of this patch in your tree.
> > If the timing is not good, please let me know when to remind you and
> > I'll send another email. Just want to make sure it does not slip
> > again.
> >
> > Just FYI, we have two other Ack'ed PSI patches for you to consider:
> >
> > https://lore.kernel.org/all/20221014110551.22695-1-zhouchengming@bytedance.com/
> > https://lore.kernel.org/all/20220919072356.GA29069@haolee.io/
>
> Thanks for the poke; I've picked up all three and will place then in
> sched/core.

Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-24 22:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-10 22:57 [RESEND PATCH v4 1/1] psi: stop relying on timer_pending for poll_work rescheduling Suren Baghdasaryan
2022-10-20 22:25 ` Suren Baghdasaryan
2022-10-24  9:56   ` Peter Zijlstra
2022-10-24 20:45     ` Suren Baghdasaryan
     [not found] ` <20221021013613.1428-1-hdanton@sina.com>
2022-10-21 19:54   ` Suren Baghdasaryan
     [not found]     ` <20221022035200.1911-1-hdanton@sina.com>
2022-10-22  4:03       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).