stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Waiman Long <longman@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Borislav Petkov <bp@alien8.de>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Will Deacon <will.deacon@arm.com>,
	huang ying <huang.ying.caritas@gmail.com>,
	Ingo Molnar <mingo@kernel.org>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.9 02/44] locking/rwsem: Prevent decrement of reader count before increment
Date: Mon, 20 May 2019 14:13:51 +0200	[thread overview]
Message-ID: <20190520115231.178993431@linuxfoundation.org> (raw)
In-Reply-To: <20190520115230.720347034@linuxfoundation.org>

[ Upstream commit a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 ]

During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the following v4.9 commit:

  70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")

To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
Link: http://lkml.kernel.org/r/20190428212557.13482-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/locking/rwsem-xadd.c | 44 +++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index be06c45cbe4f9..0cdbb636e3163 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -127,6 +127,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
 
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
@@ -185,18 +186,42 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * of the queue. We know that woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
-		struct task_struct *tsk;
-
+	list_for_each_entry(waiter, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
 			break;
 
 		woken++;
-		tsk = waiter->task;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
 
+		tsk = waiter->task;
 		get_task_struct(tsk);
-		list_del(&waiter->list);
+
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
 		 * waiter to nil such that rwsem_down_read_failed() cannot
@@ -212,15 +237,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		/* wake_q_add() already take the task ref */
 		put_task_struct(tsk);
 	}
-
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
 }
 
 /*
-- 
2.20.1




  parent reply	other threads:[~2019-05-20 12:16 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-20 12:13 [PATCH 4.9 00/44] 4.9.178-stable review Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 01/44] net: core: another layer of lists, around PF_MEMALLOC skb handling Greg Kroah-Hartman
2019-05-20 12:13 ` Greg Kroah-Hartman [this message]
2019-05-20 12:13 ` [PATCH 4.9 03/44] PCI: hv: Fix a memory leak in hv_eject_device_work() Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 04/44] x86/speculation/mds: Revert CPU buffer clear on double fault exit Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 05/44] x86/speculation/mds: Improve CPU buffer clear documentation Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 06/44] objtool: Fix function fallthrough detection Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 07/44] ARM: exynos: Fix a leaked reference by adding missing of_node_put Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 08/44] power: supply: axp288_charger: Fix unchecked return value Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 09/44] arm64: compat: Reduce address limit Greg Kroah-Hartman
2019-05-20 12:13 ` [PATCH 4.9 10/44] arm64: Clear OSDLR_EL1 on CPU boot Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 11/44] sched/x86: Save [ER]FLAGS on context switch Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 12/44] crypto: chacha20poly1305 - set cra_name correctly Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 13/44] crypto: vmx - fix copy-paste error in CTR mode Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 14/44] crypto: crct10dif-generic - fix use via crypto_shash_digest() Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 15/44] crypto: x86/crct10dif-pcl " Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 16/44] ALSA: usb-audio: Fix a memory leak bug Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 17/44] ALSA: hda/hdmi - Read the pin sense from register when repolling Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 18/44] ALSA: hda/hdmi - Consider eld_valid when reporting jack event Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 19/44] ALSA: hda/realtek - EAPD turn on later Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 20/44] ASoC: max98090: Fix restore of DAPM Muxes Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 21/44] ASoC: RT5677-SPI: Disable 16Bit SPI Transfers Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 22/44] mm/mincore.c: make mincore() more conservative Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 23/44] ocfs2: fix ocfs2 read inode data panic in ocfs2_iget Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 24/44] mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 25/44] mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 26/44] tty/vt: fix write/write race in ioctl(KDSKBSENT) handler Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 27/44] jbd2: check superblock mapped prior to committing Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 28/44] ext4: actually request zeroing of inode table after grow Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 29/44] ext4: fix ext4_show_options for file systems w/o journal Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 30/44] Btrfs: do not start a transaction at iterate_extent_inodes() Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 31/44] bcache: fix a race between cache register and cacheset unregister Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 32/44] bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 33/44] ipmi:ssif: compare block number correctly for multi-part return messages Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 34/44] crypto: gcm - Fix error return code in crypto_gcm_create_common() Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 35/44] crypto: gcm - fix incompatibility between "gcm" and "gcm_base" Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 36/44] crypto: salsa20 - dont access already-freed walk.iv Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 37/44] crypto: arm/aes-neonbs " Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 38/44] fib_rules: fix error in backport of e9919a24d302 ("fib_rules: return 0...") Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 39/44] writeback: synchronize sync(2) against cgroup writeback membership switches Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 40/44] fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 41/44] ext4: zero out the unused memory region in the extent tree block Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 42/44] ext4: fix data corruption caused by overlapping unaligned and aligned IO Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 43/44] ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug Greg Kroah-Hartman
2019-05-20 12:14 ` [PATCH 4.9 44/44] KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes Greg Kroah-Hartman
2019-05-20 19:28 ` [PATCH 4.9 00/44] 4.9.178-stable review kernelci.org bot
2019-05-21  8:51 ` Jon Hunter
2019-05-21 10:34 ` Naresh Kamboju
2019-05-21 16:46   ` Greg Kroah-Hartman
2019-05-21 21:39 ` shuah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190520115231.178993431@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bp@alien8.de \
    --cc=dave@stgolabs.net \
    --cc=huang.ying.caritas@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).