All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Neeraj Upadhyay <neeraju@codeaurora.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	David Chen <david.chen@nutanix.com>
Subject: [PATCH 4.14 02/27] rcu: Fix missed wakeup of exp_wq waiters
Date: Fri, 24 Sep 2021 14:43:56 +0200	[thread overview]
Message-ID: <20210924124329.259394431@linuxfoundation.org> (raw)
In-Reply-To: <20210924124329.173674820@linuxfoundation.org>

From: Neeraj Upadhyay <neeraju@codeaurora.org>

commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f upstream.

Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:

1.	Tasks A and B both attempt to start an expedited grace
	period at about the same time.	This grace period will have
	completed when the lower four bits of the rcu_state structure's
	->expedited_sequence field are 0b'0100', for example, when the
	initial value of this counter is zero.	Task A wins, and thus
	does the actual work of starting the grace period, including
	acquiring the rcu_state structure's .exp_mutex and sets the
	counter to 0b'0001'.

2.	Because task B lost the race to start the grace period, it
	waits on ->expedited_sequence to reach 0b'0100' inside of
	exp_funnel_lock(). This task therefore blocks on the rcu_node
	structure's ->exp_wq[1] field, keeping in mind that the
	end-of-grace-period value of ->expedited_sequence (0b'0100')
	is shifted down two bits before indexing the ->exp_wq[] field.

3.	Task C attempts to start another expedited grace period,
	but blocks on ->exp_mutex, which is still held by Task A.

4.	The aforementioned expedited grace period completes, so that
	->expedited_sequence now has the value 0b'0100'.  A kworker task
	therefore acquires the rcu_state structure's ->exp_wake_mutex
	and starts awakening any tasks waiting for this grace period.

5.	One of the first tasks awakened happens to be Task A.  Task A
	therefore releases the rcu_state structure's ->exp_mutex,
	which allows Task C to start the next expedited grace period,
	which causes the lower four bits of the rcu_state structure's
	->expedited_sequence field to become 0b'0101'.

6.	Task C's expedited grace period completes, so that the lower four
	bits of the rcu_state structure's ->expedited_sequence field now
	become 0b'1000'.

7.	The kworker task from step 4 above continues its wakeups.
	Unfortunately, the wake_up_all() refetches the rcu_state
	structure's .expedited_sequence field:

	wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

	This results in the wakeup being applied to the rcu_node
	structure's ->exp_wq[2] field, which is unfortunate given that
	Task B is instead waiting on ->exp_wq[1].

On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup.  But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period.  On such embedded systems,
this situation could therefore result in a system hang.

This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:

	schedule()
	synchronize_rcu_expedited()
	synchronize_rcu()
	scsi_device_quiesce()
	scsi_bus_suspend()
	dpm_run_callback()
	__device_suspend()

This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.

Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: David Chen <david.chen@nutanix.com>
Acked-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/rcu/tree_exp.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -534,7 +534,7 @@ static void rcu_exp_wait_wake(struct rcu
 			spin_unlock(&rnp->exp_lock);
 		}
 		smp_mb(); /* All above changes before wakeup. */
-		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
 	}
 	trace_rcu_exp_grace_period(rsp->name, s, TPS("endwake"));
 	mutex_unlock(&rsp->exp_wake_mutex);



  parent reply	other threads:[~2021-09-24 12:48 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-24 12:43 [PATCH 4.14 00/27] 4.14.248-rc1 review Greg Kroah-Hartman
2021-09-24 12:43 ` [PATCH 4.14 01/27] s390/bpf: Fix optimizing out zero-extensions Greg Kroah-Hartman
2021-09-24 12:43 ` Greg Kroah-Hartman [this message]
2021-09-24 12:43 ` [PATCH 4.14 03/27] apparmor: remove duplicate macro list_entry_is_head() Greg Kroah-Hartman
2021-09-24 12:43 ` [PATCH 4.14 04/27] crypto: talitos - fix max key size for sha384 and sha512 Greg Kroah-Hartman
2021-09-24 12:43 ` [PATCH 4.14 05/27] sctp: validate chunk size in __rcv_asconf_lookup Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 06/27] sctp: add param size validation for SCTP_PARAM_SET_PRIMARY Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 07/27] dmaengine: acpi: Avoid comparison GSI with Linux vIRQ Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 08/27] thermal/drivers/exynos: Fix an error code in exynos_tmu_probe() Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 09/27] 9p/trans_virtio: Remove sysfs file on probe failure Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 10/27] prctl: allow to setup brk for et_dyn executables Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 11/27] profiling: fix shift-out-of-bounds bugs Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 12/27] pwm: lpc32xx: Dont modify HW state in .probe() after the PWM chip was registered Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 13/27] pwm: mxs: " Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 14/27] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 15/27] parisc: Move pci_dev_is_behind_card_dino to where it is used Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 16/27] dmaengine: ioat: depends on !UML Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 17/27] dmaengine: xilinx_dma: Set DMA mask for coherent APIs Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 18/27] ceph: lockdep annotations for try_nonblocking_invalidate Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 19/27] nilfs2: fix memory leak in nilfs_sysfs_create_device_group Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 20/27] nilfs2: fix NULL pointer in nilfs_##name##_attr_release Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 21/27] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 22/27] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 23/27] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 24/27] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 25/27] pwm: rockchip: Dont modify HW state in .remove() callback Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 26/27] blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() Greg Kroah-Hartman
2021-09-24 12:44 ` [PATCH 4.14 27/27] drm/nouveau/nvkm: Replace -ENOSYS with -ENODEV Greg Kroah-Hartman
2021-09-24 13:55 ` [PATCH 4.14 00/27] 4.14.248-rc1 review Daniel Díaz
2021-09-24 17:51 ` Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210924124329.259394431@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=david.chen@nutanix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neeraju@codeaurora.org \
    --cc=paulmck@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.