linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: [PATCH 5.12 010/110] psi: Fix psi state corruption when schedule() races with cgroup move
Date: Mon, 28 Jun 2021 10:16:48 -0400	[thread overview]
Message-ID: <20210628141828.31757-11-sashal@kernel.org> (raw)
In-Reply-To: <20210628141828.31757-1-sashal@kernel.org>

From: Johannes Weiner <hannes@cmpxchg.org>

commit d583d360a620e6229422b3455d0be082b8255f5e upstream.

4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/sched/psi.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 651218ded981..ef37acd28e4a 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -965,7 +965,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
  */
 void cgroup_move_task(struct task_struct *task, struct css_set *to)
 {
-	unsigned int task_flags = 0;
+	unsigned int task_flags;
 	struct rq_flags rf;
 	struct rq *rq;
 
@@ -980,15 +980,31 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 
 	rq = task_rq_lock(task, &rf);
 
-	if (task_on_rq_queued(task)) {
-		task_flags = TSK_RUNNING;
-		if (task_current(rq, task))
-			task_flags |= TSK_ONCPU;
-	} else if (task->in_iowait)
-		task_flags = TSK_IOWAIT;
-
-	if (task->in_memstall)
-		task_flags |= TSK_MEMSTALL;
+	/*
+	 * We may race with schedule() dropping the rq lock between
+	 * deactivating prev and switching to next. Because the psi
+	 * updates from the deactivation are deferred to the switch
+	 * callback to save cgroup tree updates, the task's scheduling
+	 * state here is not coherent with its psi state:
+	 *
+	 * schedule()                   cgroup_move_task()
+	 *   rq_lock()
+	 *   deactivate_task()
+	 *     p->on_rq = 0
+	 *     psi_dequeue() // defers TSK_RUNNING & TSK_IOWAIT updates
+	 *   pick_next_task()
+	 *     rq_unlock()
+	 *                                rq_lock()
+	 *                                psi_task_change() // old cgroup
+	 *                                task->cgroups = to
+	 *                                psi_task_change() // new cgroup
+	 *                                rq_unlock()
+	 *     rq_lock()
+	 *   psi_sched_switch() // does deferred updates in new cgroup
+	 *
+	 * Don't rely on the scheduling state. Use psi_flags instead.
+	 */
+	task_flags = task->psi_flags;
 
 	if (task_flags)
 		psi_task_change(task, task_flags, 0);
-- 
2.30.2


  parent reply	other threads:[~2021-06-28 14:19 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28 14:16 [PATCH 5.12 000/110] 5.12.14-rc1 review Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 001/110] module: limit enabling module.sig_enforce Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 002/110] Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue." Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 003/110] Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell." Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 004/110] drm: add a locked version of drm_is_current_master Sasha Levin
2021-06-28 15:01   ` Emil Velikov
2021-06-28 15:07     ` Greg Kroah-Hartman
2021-06-28 14:16 ` [PATCH 5.12 005/110] drm/nouveau: wait for moving fence after pinning v2 Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 006/110] drm/radeon: wait for moving fence after pinning Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 007/110] drm/amdgpu: " Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 008/110] ARM: 9081/1: fix gcc-10 thumb2-kernel regression Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 009/110] mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk Sasha Levin
2021-06-28 14:16 ` Sasha Levin [this message]
2021-06-28 14:16 ` [PATCH 5.12 011/110] spi: spi-nxp-fspi: move the register operation after the clock enable Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 012/110] Revert "PCI: PM: Do not read power state in pci_enable_device_flags()" Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 013/110] drm/vc4: hdmi: Move the HSM clock enable to runtime_pm Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 014/110] drm/vc4: hdmi: Make sure the controller is powered in detect Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 015/110] x86/entry: Fix noinstr fail in __do_fast_syscall_32() Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 016/110] x86/xen: Fix noinstr fail in xen_pv_evtchn_do_upcall() Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 017/110] x86/xen: Fix noinstr fail in exc_xen_unknown_trap() Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 018/110] locking/lockdep: Improve noinstr vs errors Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 019/110] drm/kmb: Fix error return code in kmb_hw_init() Sasha Levin
2021-06-28 16:29   ` Chrisanthus, Anitha
2021-06-28 21:17     ` Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 020/110] perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context Sasha Levin
2021-06-28 14:16 ` [PATCH 5.12 021/110] perf/x86/intel/lbr: Zero the xstate buffer on allocation Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 022/110] dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 023/110] dmaengine: stm32-mdma: fix PM reference leak in stm32_mdma_alloc_chan_resourc() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 024/110] dmaengine: xilinx: dpdma: Add missing dependencies to Kconfig Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 025/110] dmaengine: xilinx: dpdma: Limit descriptor IDs to 16 bits Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 026/110] mac80211: remove warning in ieee80211_get_sband() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 027/110] mac80211_hwsim: drop pending frames on stop Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 028/110] cfg80211: call cfg80211_leave_ocb when switching away from OCB Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 029/110] dmaengine: idxd: Fix missing error code in idxd_cdev_open() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 030/110] dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 031/110] dmaengine: mediatek: free the proper desc in desc_free handler Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 032/110] dmaengine: mediatek: do not issue a new desc if one is still current Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 033/110] dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 034/110] net: ipv4: Remove unneed BUG() function Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 035/110] mac80211: drop multicast fragments Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 036/110] net: ethtool: clear heap allocations for ethtool function Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 037/110] inet: annotate data race in inet_send_prepare() and inet_dgram_connect() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 038/110] ping: Check return value of function 'ping_queue_rcv_skb' Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 039/110] net: annotate data race in sock_error() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 040/110] inet: annotate date races around sk->sk_txhash Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 041/110] net/packet: annotate data race in packet_sendmsg() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 042/110] net: phy: dp83867: perform soft reset and retain established link Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 043/110] riscv32: Use medany C model for modules Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 044/110] net: caif: fix memory leak in ldisc_open Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 045/110] bpf, selftests: Adjust few selftest outcomes wrt unreachable code Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 046/110] qmi_wwan: Do not call netif_rx from rx_fixup Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 047/110] net/packet: annotate accesses to po->bind Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 048/110] net/packet: annotate accesses to po->ifindex Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 049/110] r8152: Avoid memcpy() over-reading of ETH_SS_STATS Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 050/110] sh_eth: " Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 051/110] r8169: " Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 052/110] KVM: selftests: Fix kvm_check_cap() assertion Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 053/110] net: qed: Fix memcpy() overflow of qed_dcbx_params() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 054/110] mac80211: reset profile_periodicity/ema_ap Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 055/110] mac80211: handle various extensible elements correctly Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 056/110] recordmcount: Correct st_shndx handling Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 057/110] PCI: Add AMD RS690 quirk to enable 64-bit DMA Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 058/110] net: ll_temac: Add memory-barriers for TX BD access Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 059/110] net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 060/110] riscv: dts: fu740: fix cache-controller interrupts Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 061/110] perf/x86: Track pmu in per-CPU cpu_hw_events Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 062/110] pinctrl: microchip-sgpio: Put fwnode in error case during ->probe() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 063/110] pinctrl: stm32: fix the reported number of GPIO lines per bank Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 064/110] i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 065/110] gpiolib: cdev: zero padding during conversion to gpioline_info_changed Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 066/110] scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART) Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 067/110] software node: Handle software node injection to an existing device properly Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 068/110] nilfs2: fix memory leak in nilfs_sysfs_delete_device_group Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 069/110] s390/topology: clear thread/group maps for offline cpus Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 070/110] s390/stack: fix possible register corruption with stack switch helper Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 071/110] s390: fix system call restart with multiple signals Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 072/110] s390: clear pt_regs::flags on irq entry Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 073/110] KVM: do not allow mapping valid but non-reference-counted pages Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 074/110] i2c: robotfuzz-osif: fix control-request directions Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 075/110] ceph: must hold snap_rwsem when filling inode for async create Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 076/110] xen/events: reset active flag for lateeoi events later Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 077/110] kthread_worker: split code for canceling the delayed work timer Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 078/110] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 079/110] x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate() Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 080/110] x86/fpu: Make init_fpstate correct with optimized XSAVE Sasha Levin
2021-06-28 14:17 ` [PATCH 5.12 081/110] mm/memory-failure: use a mutex to avoid memory_failure() races Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 082/110] mm, thp: use head page in __migration_entry_wait() Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 083/110] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 084/110] mm/thp: make is_huge_zero_pmd() safe and quicker Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 085/110] mm/thp: try_to_unmap() use TTU_SYNC for safe splitting Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 086/110] mm/thp: fix vma_address() if virtual address below file offset Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 087/110] mm/thp: fix page_address_in_vma() on file THP tails Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 088/110] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 089/110] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 090/110] mm: page_vma_mapped_walk(): use page for pvmw->page Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 091/110] mm: page_vma_mapped_walk(): settle PageHuge on entry Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 092/110] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 093/110] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 094/110] mm: page_vma_mapped_walk(): crossing page table boundary Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 095/110] mm: page_vma_mapped_walk(): add a level of indentation Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 096/110] mm: page_vma_mapped_walk(): use goto instead of while (1) Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 097/110] mm: page_vma_mapped_walk(): get vma_address_end() earlier Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 098/110] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 099/110] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 100/110] mm, futex: fix shared futex pgoff on shmem huge page Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 101/110] KVM: SVM: Call SEV Guest Decommission if ASID binding fails Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 102/110] swiotlb: manipulate orig_addr when tlb_addr has offset Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 103/110] netfs: fix test for whether we can skip read when writing beyond EOF Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 104/110] mm/hwpoison: do not lock page again when me_huge_page() successfully recovers Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 105/110] Revert "drm: add a locked version of drm_is_current_master" Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 106/110] certs: Add EFI_CERT_X509_GUID support for dbx entries Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 107/110] certs: Move load_system_certificate_list to a common function Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 108/110] certs: Add ability to preload revocation certs Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 109/110] integrity: Load mokx variables into the blacklist keyring Sasha Levin
2021-06-28 14:18 ` [PATCH 5.12 110/110] Linux 5.12.14-rc1 Sasha Levin
2021-06-28 18:21 ` [PATCH 5.12 000/110] 5.12.14-rc1 review Fox Chen
2021-06-30 12:43   ` Sasha Levin
2021-06-28 22:53 ` Justin Forbes
2021-06-30 12:43   ` Sasha Levin
2021-06-29  6:10 ` Naresh Kamboju
2021-06-30 12:45   ` Sasha Levin
2021-06-29 18:21 ` Guenter Roeck
2021-06-30 12:45   ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210628141828.31757-11-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).