All of lore.kernel.org
 help / color / mirror / Atom feed
From: YT Chang <yt.chang@mediatek.com>
To: Suren Baghdasaryan <surenb@google.com>, <peterz@infradead.org>
Cc: <hannes@cmpxchg.org>, <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <matthias.bgg@gmail.com>,
	<minchan@google.com>, <timmurray@google.com>,
	<wenju.xu@mediatek.com>, <jonathan.jmchen@mediatek.com>,
	<linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>, <kernel-team@android.com>,
	<show-hong.chen@mediatek.com>
Subject: Re: [PATCH 1/1] psi: stop relying on timer_pending for poll_work rescheduling
Date: Mon, 21 Jun 2021 20:30:57 +0800	[thread overview]
Message-ID: <bc8128a17df82dcd0440a9d5a1a7e33135c2a2d3.camel@mediatek.com> (raw)
In-Reply-To: <20210617212654.1529125-1-surenb@google.com>

On Thu, 2021-06-17 at 14:26 -0700, Suren Baghdasaryan wrote:
> Psi polling mechanism is trying to minimize the number of wakeups to
> run psi_poll_work and is currently relying on timer_pending() to
> detect
> when this work is already scheduled. This provides a window of
> opportunity
> for psi_group_change to schedule an immediate psi_poll_work after
> poll_timer_fn got called but before psi_poll_work could reschedule
> itself.
> Below is the depiction of this entire window:
> 
> poll_timer_fn
>   wake_up_interruptible(&group->poll_wait);
> 
> psi_poll_worker
>   wait_event_interruptible(group->poll_wait, ...)
>   psi_poll_work
>     psi_schedule_poll_work
>       if (timer_pending(&group->poll_timer)) return;
>       ...
>       mod_timer(&group->poll_timer, jiffies + delay);
> 
> Prior to 461daba06bdc we used to rely on poll_scheduled atomic which
> was
> reset and set back inside psi_poll_work and therefore this race
> window
> was much smaller.
> The larger window causes increased number of wakeups and our partners
> report visible power regression of ~10mA after applying 461daba06bdc.
> Bring back the poll_scheduled atomic and make this race window even
> narrower by resetting poll_scheduled only when we reach polling
> expiration
> time. This does not completely eliminate the possibility of extra
> wakeups
> caused by a race with psi_group_change however it will limit it to
> the
> worst case scenario of one extra wakeup per every tracking window
> (0.5s
> in the worst case).
> By tracing the number of immediate rescheduling attempts performed by
> psi_group_change and the number of these attempts being blocked due
> to
> psi monitor being already active, we can assess the effects of this
> change:
> 
> Before the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules
> attempted:           684365   1385156    1261240
> Immediate reschedules
> blocked:             682846   1381654    1258682
> Immediate reschedules (delta):             1519     3502       2558
> Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
> 
> After the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules attempted:           882244   770298    426218
> Immediate reschedules blocked:             881996   769796    426074
> Immediate reschedules (delta):             248      502       144
> Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
> 
> The number of non-blocked immediate reschedules dropped from 0.22-
> 0.25%
> to 0.03-0.07%. The drop is attributed to the decrease in the race
> window
> size and the fact that we allow this race only when psi monitors
> reach
> polling window expiration time.
> 


The regression power data points of Android phone in home screen idle:
Original	:baseline
Original+ Patch : -21.5% (-11.5mA)
Tested-by: SH Chen <show-hong.chen@mediatek.com>

> Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger
> scheduling mechanism")
> Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/psi_types.h |  1 +
>  kernel/sched/psi.c        | 41 ++++++++++++++++++++++++++++---------
> --
>  2 files changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 0a23300d49af..ef8bd89d065e 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -158,6 +158,7 @@ struct psi_group {
>  	struct timer_list poll_timer;
>  	wait_queue_head_t poll_wait;
>  	atomic_t poll_wakeup;
> +	atomic_t poll_scheduled;
>  
>  	/* Protects data used by the monitor */
>  	struct mutex trigger_lock;
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3cff41f..fed7c9c2b276 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -193,6 +193,7 @@ static void group_init(struct psi_group *group)
>  	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>  	mutex_init(&group->avgs_lock);
>  	/* Init trigger-related members */
> +	atomic_set(&group->poll_scheduled, 0);
>  	mutex_init(&group->trigger_lock);
>  	INIT_LIST_HEAD(&group->triggers);
>  	memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> @@ -551,18 +552,14 @@ static u64 update_triggers(struct psi_group
> *group, u64 now)
>  	return now + group->poll_min_period;
>  }
>  
> -/* Schedule polling if it's not already scheduled. */
> -static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay)
> +/* Schedule polling if it's not already scheduled or forced. */
> +static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay,
> +				   bool force)
>  {
>  	struct task_struct *task;
>  
> -	/*
> -	 * Do not reschedule if already scheduled.
> -	 * Possible race with a timer scheduled after this check but
> before
> -	 * mod_timer below can be tolerated because group-
> >polling_next_update
> -	 * will keep updates on schedule.
> -	 */
> -	if (timer_pending(&group->poll_timer))
> +	/* cmpxchg should be called even when !force to set
> poll_scheduled */
> +	if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0 &&
> !force)
>  		return;
>  
>  	rcu_read_lock();
> @@ -574,12 +571,15 @@ static void psi_schedule_poll_work(struct
> psi_group *group, unsigned long delay)
>  	 */
>  	if (likely(task))
>  		mod_timer(&group->poll_timer, jiffies + delay);
> +	else
> +		atomic_set(&group->poll_scheduled, 0);
>  
>  	rcu_read_unlock();
>  }
>  
>  static void psi_poll_work(struct psi_group *group)
>  {
> +	bool force_reschedule = false;
>  	u32 changed_states;
>  	u64 now;
>  
> @@ -587,6 +587,23 @@ static void psi_poll_work(struct psi_group
> *group)
>  
>  	now = sched_clock();
>  
> +	if (now > group->polling_until) {
> +		/*
> +		 * We are either about to start or might stop polling
> if no
> +		 * state change was recorded. Resetting poll_scheduled
> leaves
> +		 * a small window for psi_group_change to sneak in and
> schedule
> +		 * an immegiate poll_work before we get to
> rescheduling. One
> +		 * potential extra wakeup at the end of the polling
> window
> +		 * should be negligible and polling_next_update still
> keeps
> +		 * updates correctly on schedule.
> +		 */
> +		atomic_set(&group->poll_scheduled, 0);
> +	} else {
> +		/* Polling window is not over, keep rescheduling */
> +		force_reschedule = true;
> +	}
> +
> +
>  	collect_percpu_times(group, PSI_POLL, &changed_states);
>  
>  	if (changed_states & group->poll_states) {
> @@ -612,7 +629,8 @@ static void psi_poll_work(struct psi_group
> *group)
>  		group->polling_next_update = update_triggers(group,
> now);
>  
>  	psi_schedule_poll_work(group,
> -		nsecs_to_jiffies(group->polling_next_update - now) +
> 1);
> +		nsecs_to_jiffies(group->polling_next_update - now) + 1,
> +		force_reschedule);
>  
>  out:
>  	mutex_unlock(&group->trigger_lock);
> @@ -736,7 +754,7 @@ static void psi_group_change(struct psi_group
> *group, int cpu,
>  	write_seqcount_end(&groupc->seq);
>  
>  	if (state_mask & group->poll_states)
> -		psi_schedule_poll_work(group, 1);
> +		psi_schedule_poll_work(group, 1, false);
>  
>  	if (wake_clock && !delayed_work_pending(&group->avgs_work))
>  		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
> @@ -1235,6 +1253,7 @@ static void psi_trigger_destroy(struct kref
> *ref)
>  		 */
>  		del_timer_sync(&group->poll_timer);
>  		kthread_stop(task_to_destroy);
> +		atomic_set(&group->poll_scheduled, 0);
>  	}
>  	kfree(t);
>  }

WARNING: multiple messages have this Message-ID (diff)
From: YT Chang <yt.chang@mediatek.com>
To: Suren Baghdasaryan <surenb@google.com>, <peterz@infradead.org>
Cc: <hannes@cmpxchg.org>, <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <matthias.bgg@gmail.com>,
	<minchan@google.com>, <timmurray@google.com>,
	<wenju.xu@mediatek.com>, <jonathan.jmchen@mediatek.com>,
	<linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>,  <kernel-team@android.com>,
	<show-hong.chen@mediatek.com>
Subject: Re: [PATCH 1/1] psi: stop relying on timer_pending for poll_work rescheduling
Date: Mon, 21 Jun 2021 20:30:57 +0800	[thread overview]
Message-ID: <bc8128a17df82dcd0440a9d5a1a7e33135c2a2d3.camel@mediatek.com> (raw)
In-Reply-To: <20210617212654.1529125-1-surenb@google.com>

On Thu, 2021-06-17 at 14:26 -0700, Suren Baghdasaryan wrote:
> Psi polling mechanism is trying to minimize the number of wakeups to
> run psi_poll_work and is currently relying on timer_pending() to
> detect
> when this work is already scheduled. This provides a window of
> opportunity
> for psi_group_change to schedule an immediate psi_poll_work after
> poll_timer_fn got called but before psi_poll_work could reschedule
> itself.
> Below is the depiction of this entire window:
> 
> poll_timer_fn
>   wake_up_interruptible(&group->poll_wait);
> 
> psi_poll_worker
>   wait_event_interruptible(group->poll_wait, ...)
>   psi_poll_work
>     psi_schedule_poll_work
>       if (timer_pending(&group->poll_timer)) return;
>       ...
>       mod_timer(&group->poll_timer, jiffies + delay);
> 
> Prior to 461daba06bdc we used to rely on poll_scheduled atomic which
> was
> reset and set back inside psi_poll_work and therefore this race
> window
> was much smaller.
> The larger window causes increased number of wakeups and our partners
> report visible power regression of ~10mA after applying 461daba06bdc.
> Bring back the poll_scheduled atomic and make this race window even
> narrower by resetting poll_scheduled only when we reach polling
> expiration
> time. This does not completely eliminate the possibility of extra
> wakeups
> caused by a race with psi_group_change however it will limit it to
> the
> worst case scenario of one extra wakeup per every tracking window
> (0.5s
> in the worst case).
> By tracing the number of immediate rescheduling attempts performed by
> psi_group_change and the number of these attempts being blocked due
> to
> psi monitor being already active, we can assess the effects of this
> change:
> 
> Before the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules
> attempted:           684365   1385156    1261240
> Immediate reschedules
> blocked:             682846   1381654    1258682
> Immediate reschedules (delta):             1519     3502       2558
> Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
> 
> After the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules attempted:           882244   770298    426218
> Immediate reschedules blocked:             881996   769796    426074
> Immediate reschedules (delta):             248      502       144
> Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
> 
> The number of non-blocked immediate reschedules dropped from 0.22-
> 0.25%
> to 0.03-0.07%. The drop is attributed to the decrease in the race
> window
> size and the fact that we allow this race only when psi monitors
> reach
> polling window expiration time.
> 


The regression power data points of Android phone in home screen idle:
Original	:baseline
Original+ Patch : -21.5% (-11.5mA)
Tested-by: SH Chen <show-hong.chen@mediatek.com>

> Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger
> scheduling mechanism")
> Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/psi_types.h |  1 +
>  kernel/sched/psi.c        | 41 ++++++++++++++++++++++++++++---------
> --
>  2 files changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 0a23300d49af..ef8bd89d065e 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -158,6 +158,7 @@ struct psi_group {
>  	struct timer_list poll_timer;
>  	wait_queue_head_t poll_wait;
>  	atomic_t poll_wakeup;
> +	atomic_t poll_scheduled;
>  
>  	/* Protects data used by the monitor */
>  	struct mutex trigger_lock;
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3cff41f..fed7c9c2b276 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -193,6 +193,7 @@ static void group_init(struct psi_group *group)
>  	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>  	mutex_init(&group->avgs_lock);
>  	/* Init trigger-related members */
> +	atomic_set(&group->poll_scheduled, 0);
>  	mutex_init(&group->trigger_lock);
>  	INIT_LIST_HEAD(&group->triggers);
>  	memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> @@ -551,18 +552,14 @@ static u64 update_triggers(struct psi_group
> *group, u64 now)
>  	return now + group->poll_min_period;
>  }
>  
> -/* Schedule polling if it's not already scheduled. */
> -static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay)
> +/* Schedule polling if it's not already scheduled or forced. */
> +static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay,
> +				   bool force)
>  {
>  	struct task_struct *task;
>  
> -	/*
> -	 * Do not reschedule if already scheduled.
> -	 * Possible race with a timer scheduled after this check but
> before
> -	 * mod_timer below can be tolerated because group-
> >polling_next_update
> -	 * will keep updates on schedule.
> -	 */
> -	if (timer_pending(&group->poll_timer))
> +	/* cmpxchg should be called even when !force to set
> poll_scheduled */
> +	if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0 &&
> !force)
>  		return;
>  
>  	rcu_read_lock();
> @@ -574,12 +571,15 @@ static void psi_schedule_poll_work(struct
> psi_group *group, unsigned long delay)
>  	 */
>  	if (likely(task))
>  		mod_timer(&group->poll_timer, jiffies + delay);
> +	else
> +		atomic_set(&group->poll_scheduled, 0);
>  
>  	rcu_read_unlock();
>  }
>  
>  static void psi_poll_work(struct psi_group *group)
>  {
> +	bool force_reschedule = false;
>  	u32 changed_states;
>  	u64 now;
>  
> @@ -587,6 +587,23 @@ static void psi_poll_work(struct psi_group
> *group)
>  
>  	now = sched_clock();
>  
> +	if (now > group->polling_until) {
> +		/*
> +		 * We are either about to start or might stop polling
> if no
> +		 * state change was recorded. Resetting poll_scheduled
> leaves
> +		 * a small window for psi_group_change to sneak in and
> schedule
> +		 * an immegiate poll_work before we get to
> rescheduling. One
> +		 * potential extra wakeup at the end of the polling
> window
> +		 * should be negligible and polling_next_update still
> keeps
> +		 * updates correctly on schedule.
> +		 */
> +		atomic_set(&group->poll_scheduled, 0);
> +	} else {
> +		/* Polling window is not over, keep rescheduling */
> +		force_reschedule = true;
> +	}
> +
> +
>  	collect_percpu_times(group, PSI_POLL, &changed_states);
>  
>  	if (changed_states & group->poll_states) {
> @@ -612,7 +629,8 @@ static void psi_poll_work(struct psi_group
> *group)
>  		group->polling_next_update = update_triggers(group,
> now);
>  
>  	psi_schedule_poll_work(group,
> -		nsecs_to_jiffies(group->polling_next_update - now) +
> 1);
> +		nsecs_to_jiffies(group->polling_next_update - now) + 1,
> +		force_reschedule);
>  
>  out:
>  	mutex_unlock(&group->trigger_lock);
> @@ -736,7 +754,7 @@ static void psi_group_change(struct psi_group
> *group, int cpu,
>  	write_seqcount_end(&groupc->seq);
>  
>  	if (state_mask & group->poll_states)
> -		psi_schedule_poll_work(group, 1);
> +		psi_schedule_poll_work(group, 1, false);
>  
>  	if (wake_clock && !delayed_work_pending(&group->avgs_work))
>  		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
> @@ -1235,6 +1253,7 @@ static void psi_trigger_destroy(struct kref
> *ref)
>  		 */
>  		del_timer_sync(&group->poll_timer);
>  		kthread_stop(task_to_destroy);
> +		atomic_set(&group->poll_scheduled, 0);
>  	}
>  	kfree(t);
>  }
_______________________________________________
Linux-mediatek mailing list
Linux-mediatek@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-mediatek

WARNING: multiple messages have this Message-ID (diff)
From: YT Chang <yt.chang@mediatek.com>
To: Suren Baghdasaryan <surenb@google.com>, <peterz@infradead.org>
Cc: <hannes@cmpxchg.org>, <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <matthias.bgg@gmail.com>,
	<minchan@google.com>, <timmurray@google.com>,
	<wenju.xu@mediatek.com>, <jonathan.jmchen@mediatek.com>,
	<linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>,  <kernel-team@android.com>,
	<show-hong.chen@mediatek.com>
Subject: Re: [PATCH 1/1] psi: stop relying on timer_pending for poll_work rescheduling
Date: Mon, 21 Jun 2021 20:30:57 +0800	[thread overview]
Message-ID: <bc8128a17df82dcd0440a9d5a1a7e33135c2a2d3.camel@mediatek.com> (raw)
In-Reply-To: <20210617212654.1529125-1-surenb@google.com>

On Thu, 2021-06-17 at 14:26 -0700, Suren Baghdasaryan wrote:
> Psi polling mechanism is trying to minimize the number of wakeups to
> run psi_poll_work and is currently relying on timer_pending() to
> detect
> when this work is already scheduled. This provides a window of
> opportunity
> for psi_group_change to schedule an immediate psi_poll_work after
> poll_timer_fn got called but before psi_poll_work could reschedule
> itself.
> Below is the depiction of this entire window:
> 
> poll_timer_fn
>   wake_up_interruptible(&group->poll_wait);
> 
> psi_poll_worker
>   wait_event_interruptible(group->poll_wait, ...)
>   psi_poll_work
>     psi_schedule_poll_work
>       if (timer_pending(&group->poll_timer)) return;
>       ...
>       mod_timer(&group->poll_timer, jiffies + delay);
> 
> Prior to 461daba06bdc we used to rely on poll_scheduled atomic which
> was
> reset and set back inside psi_poll_work and therefore this race
> window
> was much smaller.
> The larger window causes increased number of wakeups and our partners
> report visible power regression of ~10mA after applying 461daba06bdc.
> Bring back the poll_scheduled atomic and make this race window even
> narrower by resetting poll_scheduled only when we reach polling
> expiration
> time. This does not completely eliminate the possibility of extra
> wakeups
> caused by a race with psi_group_change however it will limit it to
> the
> worst case scenario of one extra wakeup per every tracking window
> (0.5s
> in the worst case).
> By tracing the number of immediate rescheduling attempts performed by
> psi_group_change and the number of these attempts being blocked due
> to
> psi monitor being already active, we can assess the effects of this
> change:
> 
> Before the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules
> attempted:           684365   1385156    1261240
> Immediate reschedules
> blocked:             682846   1381654    1258682
> Immediate reschedules (delta):             1519     3502       2558
> Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
> 
> After the patch:
>                                            Run#1    Run#2      Run#3
> Immediate reschedules attempted:           882244   770298    426218
> Immediate reschedules blocked:             881996   769796    426074
> Immediate reschedules (delta):             248      502       144
> Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
> 
> The number of non-blocked immediate reschedules dropped from 0.22-
> 0.25%
> to 0.03-0.07%. The drop is attributed to the decrease in the race
> window
> size and the fact that we allow this race only when psi monitors
> reach
> polling window expiration time.
> 


The regression power data points of Android phone in home screen idle:
Original	:baseline
Original+ Patch : -21.5% (-11.5mA)
Tested-by: SH Chen <show-hong.chen@mediatek.com>

> Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger
> scheduling mechanism")
> Reported-by: Kathleen Chang <yt.chang@mediatek.com>
> Reported-by: Wenju Xu <wenju.xu@mediatek.com>
> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/psi_types.h |  1 +
>  kernel/sched/psi.c        | 41 ++++++++++++++++++++++++++++---------
> --
>  2 files changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 0a23300d49af..ef8bd89d065e 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -158,6 +158,7 @@ struct psi_group {
>  	struct timer_list poll_timer;
>  	wait_queue_head_t poll_wait;
>  	atomic_t poll_wakeup;
> +	atomic_t poll_scheduled;
>  
>  	/* Protects data used by the monitor */
>  	struct mutex trigger_lock;
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3cff41f..fed7c9c2b276 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -193,6 +193,7 @@ static void group_init(struct psi_group *group)
>  	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>  	mutex_init(&group->avgs_lock);
>  	/* Init trigger-related members */
> +	atomic_set(&group->poll_scheduled, 0);
>  	mutex_init(&group->trigger_lock);
>  	INIT_LIST_HEAD(&group->triggers);
>  	memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> @@ -551,18 +552,14 @@ static u64 update_triggers(struct psi_group
> *group, u64 now)
>  	return now + group->poll_min_period;
>  }
>  
> -/* Schedule polling if it's not already scheduled. */
> -static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay)
> +/* Schedule polling if it's not already scheduled or forced. */
> +static void psi_schedule_poll_work(struct psi_group *group, unsigned
> long delay,
> +				   bool force)
>  {
>  	struct task_struct *task;
>  
> -	/*
> -	 * Do not reschedule if already scheduled.
> -	 * Possible race with a timer scheduled after this check but
> before
> -	 * mod_timer below can be tolerated because group-
> >polling_next_update
> -	 * will keep updates on schedule.
> -	 */
> -	if (timer_pending(&group->poll_timer))
> +	/* cmpxchg should be called even when !force to set
> poll_scheduled */
> +	if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0 &&
> !force)
>  		return;
>  
>  	rcu_read_lock();
> @@ -574,12 +571,15 @@ static void psi_schedule_poll_work(struct
> psi_group *group, unsigned long delay)
>  	 */
>  	if (likely(task))
>  		mod_timer(&group->poll_timer, jiffies + delay);
> +	else
> +		atomic_set(&group->poll_scheduled, 0);
>  
>  	rcu_read_unlock();
>  }
>  
>  static void psi_poll_work(struct psi_group *group)
>  {
> +	bool force_reschedule = false;
>  	u32 changed_states;
>  	u64 now;
>  
> @@ -587,6 +587,23 @@ static void psi_poll_work(struct psi_group
> *group)
>  
>  	now = sched_clock();
>  
> +	if (now > group->polling_until) {
> +		/*
> +		 * We are either about to start or might stop polling
> if no
> +		 * state change was recorded. Resetting poll_scheduled
> leaves
> +		 * a small window for psi_group_change to sneak in and
> schedule
> +		 * an immegiate poll_work before we get to
> rescheduling. One
> +		 * potential extra wakeup at the end of the polling
> window
> +		 * should be negligible and polling_next_update still
> keeps
> +		 * updates correctly on schedule.
> +		 */
> +		atomic_set(&group->poll_scheduled, 0);
> +	} else {
> +		/* Polling window is not over, keep rescheduling */
> +		force_reschedule = true;
> +	}
> +
> +
>  	collect_percpu_times(group, PSI_POLL, &changed_states);
>  
>  	if (changed_states & group->poll_states) {
> @@ -612,7 +629,8 @@ static void psi_poll_work(struct psi_group
> *group)
>  		group->polling_next_update = update_triggers(group,
> now);
>  
>  	psi_schedule_poll_work(group,
> -		nsecs_to_jiffies(group->polling_next_update - now) +
> 1);
> +		nsecs_to_jiffies(group->polling_next_update - now) + 1,
> +		force_reschedule);
>  
>  out:
>  	mutex_unlock(&group->trigger_lock);
> @@ -736,7 +754,7 @@ static void psi_group_change(struct psi_group
> *group, int cpu,
>  	write_seqcount_end(&groupc->seq);
>  
>  	if (state_mask & group->poll_states)
> -		psi_schedule_poll_work(group, 1);
> +		psi_schedule_poll_work(group, 1, false);
>  
>  	if (wake_clock && !delayed_work_pending(&group->avgs_work))
>  		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
> @@ -1235,6 +1253,7 @@ static void psi_trigger_destroy(struct kref
> *ref)
>  		 */
>  		del_timer_sync(&group->poll_timer);
>  		kthread_stop(task_to_destroy);
> +		atomic_set(&group->poll_scheduled, 0);
>  	}
>  	kfree(t);
>  }
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-06-21 12:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17 21:26 [PATCH 1/1] psi: stop relying on timer_pending for poll_work rescheduling Suren Baghdasaryan
2021-06-17 21:26 ` Suren Baghdasaryan
2021-06-17 21:26 ` Suren Baghdasaryan
2021-06-21 12:30 ` YT Chang [this message]
2021-06-21 12:30   ` YT Chang
2021-06-21 12:30   ` YT Chang
2021-06-21 17:49   ` Suren Baghdasaryan
2021-06-21 17:49     ` Suren Baghdasaryan
2021-06-21 17:49     ` Suren Baghdasaryan
2021-06-22 13:56 ` Peter Zijlstra
2021-06-22 13:56   ` Peter Zijlstra
2021-06-22 13:56   ` Peter Zijlstra
2021-06-22 15:07   ` Johannes Weiner
2021-06-22 15:07     ` Johannes Weiner
2021-06-22 15:07     ` Johannes Weiner
2021-06-22 15:48     ` Suren Baghdasaryan
2021-06-22 15:48       ` Suren Baghdasaryan
2021-06-22 15:48       ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc8128a17df82dcd0440a9d5a1a7e33135c2a2d3.camel@mediatek.com \
    --to=yt.chang@mediatek.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jonathan.jmchen@mediatek.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=matthias.bgg@gmail.com \
    --cc=mgorman@suse.de \
    --cc=minchan@google.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=show-hong.chen@mediatek.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=vincent.guittot@linaro.org \
    --cc=wenju.xu@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.