stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Kathleen Chang <yt.chang@mediatek.com>,
	Wenju Xu <wenju.xu@mediatek.com>,
	Jonathan Chen <jonathan.jmchen@mediatek.com>,
	Suren Baghdasaryan <surenb@google.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	SH Chen <show-hong.chen@mediatek.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.1 11/46] sched/psi: Stop relying on timer_pending() for poll_work rescheduling
Date: Thu, 23 Feb 2023 14:06:18 +0100	[thread overview]
Message-ID: <20230223130432.066371926@linuxfoundation.org> (raw)
In-Reply-To: <20230223130431.553657459@linuxfoundation.org>

From: Suren Baghdasaryan <surenb@google.com>

[ Upstream commit 710ffe671e014d5ccbcff225130a178b088ef090 ]

Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&group->poll_wait);

psi_poll_worker
  wait_event_interruptible(group->poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&group->poll_timer)) return;
      ...
      mod_timer(&group->poll_timer, jiffies + delay);

Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/psi_types.h |  1 +
 kernel/sched/psi.c        | 62 ++++++++++++++++++++++++++++++++-------
 2 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 6e43727350689..14a1ebb74e11f 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -177,6 +177,7 @@ struct psi_group {
 	struct timer_list poll_timer;
 	wait_queue_head_t poll_wait;
 	atomic_t poll_wakeup;
+	atomic_t poll_scheduled;
 
 	/* Protects data used by the monitor */
 	struct mutex trigger_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 48fedeee15c5b..e83c321461cf4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -189,6 +189,7 @@ static void group_init(struct psi_group *group)
 	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
 	mutex_init(&group->avgs_lock);
 	/* Init trigger-related members */
+	atomic_set(&group->poll_scheduled, 0);
 	mutex_init(&group->trigger_lock);
 	INIT_LIST_HEAD(&group->triggers);
 	group->poll_min_period = U32_MAX;
@@ -565,18 +566,17 @@ static u64 update_triggers(struct psi_group *group, u64 now)
 	return now + group->poll_min_period;
 }
 
-/* Schedule polling if it's not already scheduled. */
-static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
+/* Schedule polling if it's not already scheduled or forced. */
+static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
+				   bool force)
 {
 	struct task_struct *task;
 
 	/*
-	 * Do not reschedule if already scheduled.
-	 * Possible race with a timer scheduled after this check but before
-	 * mod_timer below can be tolerated because group->polling_next_update
-	 * will keep updates on schedule.
+	 * atomic_xchg should be called even when !force to provide a
+	 * full memory barrier (see the comment inside psi_poll_work).
 	 */
-	if (timer_pending(&group->poll_timer))
+	if (atomic_xchg(&group->poll_scheduled, 1) && !force)
 		return;
 
 	rcu_read_lock();
@@ -588,12 +588,15 @@ static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
 	 */
 	if (likely(task))
 		mod_timer(&group->poll_timer, jiffies + delay);
+	else
+		atomic_set(&group->poll_scheduled, 0);
 
 	rcu_read_unlock();
 }
 
 static void psi_poll_work(struct psi_group *group)
 {
+	bool force_reschedule = false;
 	u32 changed_states;
 	u64 now;
 
@@ -601,6 +604,43 @@ static void psi_poll_work(struct psi_group *group)
 
 	now = sched_clock();
 
+	if (now > group->polling_until) {
+		/*
+		 * We are either about to start or might stop polling if no
+		 * state change was recorded. Resetting poll_scheduled leaves
+		 * a small window for psi_group_change to sneak in and schedule
+		 * an immediate poll_work before we get to rescheduling. One
+		 * potential extra wakeup at the end of the polling window
+		 * should be negligible and polling_next_update still keeps
+		 * updates correctly on schedule.
+		 */
+		atomic_set(&group->poll_scheduled, 0);
+		/*
+		 * A task change can race with the poll worker that is supposed to
+		 * report on it. To avoid missing events, ensure ordering between
+		 * poll_scheduled and the task state accesses, such that if the poll
+		 * worker misses the state update, the task change is guaranteed to
+		 * reschedule the poll worker:
+		 *
+		 * poll worker:
+		 *   atomic_set(poll_scheduled, 0)
+		 *   smp_mb()
+		 *   LOAD states
+		 *
+		 * task change:
+		 *   STORE states
+		 *   if atomic_xchg(poll_scheduled, 1) == 0:
+		 *     schedule poll worker
+		 *
+		 * The atomic_xchg() implies a full barrier.
+		 */
+		smp_mb();
+	} else {
+		/* Polling window is not over, keep rescheduling */
+		force_reschedule = true;
+	}
+
+
 	collect_percpu_times(group, PSI_POLL, &changed_states);
 
 	if (changed_states & group->poll_states) {
@@ -626,7 +666,8 @@ static void psi_poll_work(struct psi_group *group)
 		group->polling_next_update = update_triggers(group, now);
 
 	psi_schedule_poll_work(group,
-		nsecs_to_jiffies(group->polling_next_update - now) + 1);
+		nsecs_to_jiffies(group->polling_next_update - now) + 1,
+		force_reschedule);
 
 out:
 	mutex_unlock(&group->trigger_lock);
@@ -787,7 +828,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	write_seqcount_end(&groupc->seq);
 
 	if (state_mask & group->poll_states)
-		psi_schedule_poll_work(group, 1);
+		psi_schedule_poll_work(group, 1, false);
 
 	if (wake_clock && !delayed_work_pending(&group->avgs_work))
 		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
@@ -941,7 +982,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
 		write_seqcount_end(&groupc->seq);
 
 		if (group->poll_states & (1 << PSI_IRQ_FULL))
-			psi_schedule_poll_work(group, 1);
+			psi_schedule_poll_work(group, 1, false);
 	} while ((group = group->parent));
 }
 #endif
@@ -1328,6 +1369,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
 		 * can no longer be found through group->poll_task.
 		 */
 		kthread_stop(task_to_destroy);
+		atomic_set(&group->poll_scheduled, 0);
 	}
 	kfree(t);
 }
-- 
2.39.0




  parent reply	other threads:[~2023-02-23 13:10 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-23 13:06 [PATCH 6.1 00/46] 6.1.14-rc1 review Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 01/46] drm/etnaviv: dont truncate physical page address Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 02/46] wifi: ath11k: fix warning in dma_free_coherent() of memory chunks while recovery Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 03/46] wifi: rtl8xxxu: gen2: Turn on the rate control Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 04/46] drm/edid: Fix minimum bpc supported with DSC1.2 for HDMI sink Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 05/46] clk: mxl: Switch from direct readl/writel based IO to regmap based IO Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 06/46] clk: mxl: Remove redundant spinlocks Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 07/46] clk: mxl: Add option to override gate clks Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 08/46] clk: mxl: Fix a clk entry by adding relevant flags Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 09/46] powerpc: dts: t208x: Mark MAC1 and MAC2 as 10G Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 10/46] clk: mxl: syscon_node_to_regmap() returns error pointers Greg Kroah-Hartman
2023-02-23 13:06 ` Greg Kroah-Hartman [this message]
2023-02-23 13:06 ` [PATCH 6.1 12/46] random: always mix cycle counter in add_latent_entropy() Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 13/46] scsi: libsas: Add smp_ata_check_ready_type() Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 14/46] scsi: hisi_sas: Fix SATA devices missing issue during I_T nexus reset Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 15/46] spi: mediatek: Enable irq when pdata is ready Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 16/46] docs: perf: Fix PMU instance name of hisi-pcie-pmu Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 17/46] KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 18/46] KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isnt valid Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 19/46] KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 20/46] can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 21/46] powerpc: dts: t208x: Disable 10G on MAC1 and MAC2 Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 22/46] spi: mediatek: Enable irq before the spi registration Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 23/46] drm/i915: Remove __maybe_unused from mtl_info Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 24/46] KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 25/46] selftests: kvm: move declaration at the beginning of main() Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 26/46] powerpc/64s/radix: Fix RWX mapping with relocated kernel Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 27/46] nfp: ethtool: support reporting link modes Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 28/46] nfp: ethtool: fix the bug of setting unsupported port speed Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 29/46] uaccess: Add speculation barrier to copy_from_user() Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 30/46] x86/alternatives: Introduce int3_emulate_jcc() Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 31/46] x86/alternatives: Teach text_poke_bp() to patch Jcc.d32 instructions Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 32/46] x86/static_call: Add support for Jcc tail-calls Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 33/46] Bluetooth: btusb: Add more device IDs for WCN6855 Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 34/46] riscv: remove special treatment for the link order of head.o Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 35/46] arm64: " Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 36/46] arch: fix broken BuildID for arm64 and riscv Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 37/46] powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 38/46] powerpc/vmlinux.lds: Dont discard .rela* for relocatable builds Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 39/46] s390: define RUNTIME_DISCARD_EXIT to fix link error with GNU ld < 2.36 Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 40/46] sh: define RUNTIME_DISCARD_EXIT Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 41/46] wifi: mwifiex: Add missing compatible string for SD8787 Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 42/46] audit: update the mailing list in MAINTAINERS Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 43/46] platform/x86/amd/pmf: Add depends on CONFIG_POWER_SUPPLY Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 44/46] platform/x86: nvidia-wmi-ec-backlight: Add force module parameter Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 45/46] ext4: Fix function prototype mismatch for ext4_feat_ktype Greg Kroah-Hartman
2023-02-23 13:06 ` [PATCH 6.1 46/46] randstruct: disable Clang 15 support Greg Kroah-Hartman
2023-02-23 14:33 ` [PATCH 6.1 00/46] 6.1.14-rc1 review Conor Dooley
2023-02-23 14:37   ` Greg Kroah-Hartman
2023-02-23 14:41     ` Conor Dooley
2023-02-23 14:38   ` Naresh Kamboju
2023-02-23 16:31     ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230223130432.066371926@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jonathan.jmchen@mediatek.com \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=sashal@kernel.org \
    --cc=show-hong.chen@mediatek.com \
    --cc=stable@vger.kernel.org \
    --cc=surenb@google.com \
    --cc=wenju.xu@mediatek.com \
    --cc=yt.chang@mediatek.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).