linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Hutchings <ben@decadent.org.uk>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,
	Thomas Gleixner <tglx@linutronix.de>,
	Darren Hart <dvhart@linux.intel.com>
Subject: [ 03/89] futex: Handle futex_pi OWNER_DIED take over correctly
Date: Mon, 03 Dec 2012 14:31:49 +0000	[thread overview]
Message-ID: <20121203143147.095347645@decadent.org.uk> (raw)
In-Reply-To: <20121203143146.549859007@decadent.org.uk>

3.2-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Gleixner <tglx@linutronix.de>

commit 59fa6245192159ab5e1e17b8e31f15afa9cff4bf upstream.

Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

The detailed problem analysis shows:

Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Succeeds due to the check for F's userspace TID field == 0
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes T2 via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains real ownership of the
       pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T3 --> observes inconsistent state

This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.

So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:

	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.

AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.

Now the proposed change

-	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+       if (unlikely(ownerdied ||
+                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.

What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!

Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.

Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.

lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.

So the above scenario changes to:

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Finds pi_state and blocks on pi_state->rt_mutex

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes it via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains ownership of the pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.

Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.

As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)

Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 kernel/futex.c |   41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3717e7b..20ef219 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,7 +716,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				struct futex_pi_state **ps,
 				struct task_struct *task, int set_waiters)
 {
-	int lock_taken, ret, ownerdied = 0;
+	int lock_taken, ret, force_take = 0;
 	u32 uval, newval, curval, vpid = task_pid_vnr(task);
 
 retry:
@@ -755,17 +755,15 @@ retry:
 	newval = curval | FUTEX_WAITERS;
 
 	/*
-	 * There are two cases, where a futex might have no owner (the
-	 * owner TID is 0): OWNER_DIED. We take over the futex in this
-	 * case. We also do an unconditional take over, when the owner
-	 * of the futex died.
-	 *
-	 * This is safe as we are protected by the hash bucket lock !
+	 * Should we force take the futex? See below.
 	 */
-	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
-		/* Keep the OWNER_DIED bit */
+	if (unlikely(force_take)) {
+		/*
+		 * Keep the OWNER_DIED and the WAITERS bit and set the
+		 * new TID value.
+		 */
 		newval = (curval & ~FUTEX_TID_MASK) | vpid;
-		ownerdied = 0;
+		force_take = 0;
 		lock_taken = 1;
 	}
 
@@ -775,7 +773,7 @@ retry:
 		goto retry;
 
 	/*
-	 * We took the lock due to owner died take over.
+	 * We took the lock due to forced take over.
 	 */
 	if (unlikely(lock_taken))
 		return 1;
@@ -790,20 +788,25 @@ retry:
 		switch (ret) {
 		case -ESRCH:
 			/*
-			 * No owner found for this futex. Check if the
-			 * OWNER_DIED bit is set to figure out whether
-			 * this is a robust futex or not.
+			 * We failed to find an owner for this
+			 * futex. So we have no pi_state to block
+			 * on. This can happen in two cases:
+			 *
+			 * 1) The owner died
+			 * 2) A stale FUTEX_WAITERS bit
+			 *
+			 * Re-read the futex value.
 			 */
 			if (get_futex_value_locked(&curval, uaddr))
 				return -EFAULT;
 
 			/*
-			 * We simply start over in case of a robust
-			 * futex. The code above will take the futex
-			 * and return happy.
+			 * If the owner died or we have a stale
+			 * WAITERS bit the owner TID in the user space
+			 * futex is 0.
 			 */
-			if (curval & FUTEX_OWNER_DIED) {
-				ownerdied = 1;
+			if (!(curval & FUTEX_TID_MASK)) {
+				force_take = 1;
 				goto retry;
 			}
 		default:



  parent reply	other threads:[~2012-12-03 14:58 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-03 14:31 [ 00/89] 3.2.35-stable review Ben Hutchings
2012-12-03 14:31 ` [ 01/89] UBIFS: introduce categorized lprops counter Ben Hutchings
2012-12-03 14:31 ` [ 02/89] UBIFS: fix mounting problems after power cuts Ben Hutchings
2012-12-03 14:31 ` Ben Hutchings [this message]
2012-12-03 14:31 ` [ 04/89] mac80211: sync acccess to tx_filtered/ps_tx_buf queues Ben Hutchings
2012-12-03 14:31 ` [ 05/89] ASoC: wm8978: pll incorrectly configured when codec is master Ben Hutchings
2012-12-03 14:31 ` [ 06/89] device_cgroup: fix RCU usage Ben Hutchings
2012-12-06 19:36   ` Herton Ronaldo Krzesinski
2012-12-07  1:41     ` Ben Hutchings
2012-12-03 14:31 ` [ 07/89] ASoC: dapm: Use card_list during DAPM shutdown Ben Hutchings
2012-12-03 14:31 ` [ 08/89] s390/signal: set correct address space control Ben Hutchings
2012-12-03 14:31 ` [ 09/89] wireless: allow 40 MHz on world roaming channels 12/13 Ben Hutchings
2012-12-03 14:31 ` [ 10/89] drm/i915/sdvo: clean up connectors on intel_sdvo_init() failures Ben Hutchings
2012-12-03 14:31 ` [ 11/89] s390/gup: add missing TASK_SIZE check to get_user_pages_fast() Ben Hutchings
2012-12-03 14:31 ` [ 12/89] USB: option: add Novatel E362 and Dell Wireless 5800 USB IDs Ben Hutchings
2012-12-03 14:31 ` [ 13/89] USB: option: add Alcatel X220/X500D " Ben Hutchings
2012-12-03 14:32 ` [ 14/89] drm/radeon: fix logic error in atombios_encoders.c Ben Hutchings
2012-12-03 14:32 ` [ 15/89] ttm: Clear the ttm page allocated from high memory zone correctly Ben Hutchings
2012-12-03 14:32 ` [ 16/89] memcg: oom: fix totalpages calculation for memory.swappiness==0 Ben Hutchings
2012-12-03 14:32 ` [ 17/89] tmpfs: change final i_blocks BUG to WARNING Ben Hutchings
2012-12-03 14:32 ` [ 18/89] x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct mapping Ben Hutchings
2012-12-03 14:32 ` [ 19/89] x86, mm: Find_early_table_space based on ranges that are actually being mapped Ben Hutchings
2012-12-03 14:32 ` [ 20/89] x86, mm: Undo incorrect revert in arch/x86/mm/init.c Ben Hutchings
2012-12-03 14:32 ` [ 21/89] netfilter: Mark SYN/ACK packets as invalid from original direction Ben Hutchings
2012-12-03 14:32 ` [ 22/89] netfilter: Validate the sequence number of dataless ACK packets as well Ben Hutchings
2012-12-03 14:32 ` [ 23/89] netfilter: nf_nat: dont check for port change on ICMP tuples Ben Hutchings
2012-12-03 14:32 ` [ 24/89] ipv4: avoid undefined behavior in do_ip_setsockopt() Ben Hutchings
2012-12-03 14:32 ` [ 25/89] ipv6: setsockopt(IPIPPROTO_IPV6, IPV6_MINHOPCOUNT) forgot to set return value Ben Hutchings
2012-12-03 14:32 ` [ 26/89] net: correct check in dev_addr_del() Ben Hutchings
2012-12-03 14:32 ` [ 27/89] net-rps: Fix brokeness causing OOO packets Ben Hutchings
2012-12-03 14:32 ` [ 28/89] usb: use usb_serial_put in usb_serial_probe errors Ben Hutchings
2012-12-03 14:32 ` [ 29/89] PCI : Calculate right add_size Ben Hutchings
2012-12-03 14:32 ` [ 30/89] Input: i8042 - also perform controller reset when suspending Ben Hutchings
2012-12-03 14:32 ` [ 31/89] ixgbe: add support for new 82599 device id Ben Hutchings
2012-12-03 14:32 ` [ 32/89] ixgbe: add support for X540-AT1 Ben Hutchings
2012-12-03 14:32 ` [ 33/89] drm/i915: Check VBIOS value for determining LVDS dual channel mode, too Ben Hutchings
2012-12-03 14:32 ` [ 34/89] get_dvb_firmware: fix download site for tda10046 firmware Ben Hutchings
2012-12-03 14:32 ` [ 35/89] m68k: fix sigset_t accessor functions Ben Hutchings
2012-12-03 14:32 ` [ 36/89] HID: add quirk for Freescale i.MX28 ROM recovery Ben Hutchings
2012-12-03 14:32 ` [ 37/89] brcm80211: smac: only print block-ack timeout message at trace level Ben Hutchings
2012-12-03 14:32 ` [ 38/89] bas_gigaset: fix pre_reset handling Ben Hutchings
2012-12-03 14:32 ` [ 39/89] GFS2: Test bufdata with buffer locked and gfs2_log_lock held Ben Hutchings
2012-12-03 14:32 ` [ 40/89] ptp: update adjfreq callback description Ben Hutchings
2012-12-03 14:32 ` [ 41/89] watchdog: iTCO_wdt: add Intel Lynx Point DeviceIDs Ben Hutchings
2012-12-03 14:32 ` [ 42/89] acer-wmi: support for P key on TM8372 Ben Hutchings
2012-12-03 14:32 ` [ 43/89] xhci: Remove warnings about MSI and MSI-X capabilities Ben Hutchings
2012-12-03 14:32 ` [ 44/89] xhci: Remove scary warnings about transfer issues Ben Hutchings
2012-12-03 14:32 ` [ 45/89] x86, mce, therm_throt: Dont report power limit and package level thermal throttle events in mcelog Ben Hutchings
2012-12-03 14:32 ` [ 46/89] Input: bcm5974 - set BUTTONPAD property Ben Hutchings
2012-12-03 14:32 ` [ 47/89] watchdog: using u64 in get_sample_period() Ben Hutchings
2012-12-03 14:32 ` [ 48/89] x86, amd: Disable way access filter on Piledriver CPUs Ben Hutchings
2012-12-03 14:32 ` [ 49/89] mtd: ofpart: Fix incorrect NULL check in parse_ofoldpart_partitions() Ben Hutchings
2012-12-03 14:32 ` [ 50/89] mtd: slram: invalid checking of absolute end address Ben Hutchings
2012-12-03 14:32 ` [ 51/89] jffs2: Fix lock acquisition order bug in jffs2_write_begin Ben Hutchings
2012-12-03 14:32 ` [ 52/89] [SCSI] isci: copy fis 0x34 response into proper buffer Ben Hutchings
2012-12-03 14:32 ` [ 53/89] mac80211: deinitialize ibss-internals after emptiness check Ben Hutchings
2012-12-03 14:32 ` [ 54/89] [PARISC] fix virtual aliasing issue in get_shared_area() Ben Hutchings
2012-12-03 14:32 ` [ 55/89] rtlwifi: rtl8192cu: Add new USB ID Ben Hutchings
2012-12-03 14:32 ` [ 56/89] mwifiex: fix system hang issue in cmd timeout error case Ben Hutchings
2012-12-03 14:32 ` [ 57/89] mwifiex: report error to MMC core if we cannot suspend Ben Hutchings
2012-12-03 14:32 ` [ 58/89] xfs: drop buffer io reference when a bad bio is built Ben Hutchings
2012-12-03 14:32 ` [ 59/89] ALSA: ua101, usx2y: fix broken MIDI output Ben Hutchings
2012-12-03 14:32 ` [ 60/89] sparc64: not any error from do_sigaltstack() should fail rt_sigreturn() Ben Hutchings
2012-12-03 14:32 ` [ 61/89] reiserfs: Fix lock ordering during remount Ben Hutchings
2012-12-03 14:32 ` [ 62/89] reiserfs: Protect reiserfs_quota_on() with write lock Ben Hutchings
2012-12-03 14:32 ` [ 63/89] reiserfs: Protect reiserfs_quota_write() " Ben Hutchings
2012-12-03 14:32 ` [ 64/89] reiserfs: Move quota calls out of " Ben Hutchings
2012-12-03 14:32 ` [ 65/89] md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock Ben Hutchings
2012-12-03 14:32 ` [ 66/89] md: Avoid write invalid address if read_seqretry returned true Ben Hutchings
2012-12-03 14:32 ` [ 67/89] drm/radeon: properly track the crtc not_enabled case evergreen_mc_stop() Ben Hutchings
2012-12-03 15:24   ` Josh Boyer
2012-12-03 15:35     ` Deucher, Alexander
2012-12-03 23:26       ` Josh Boyer
2012-12-03 23:40         ` Deucher, Alexander
2012-12-04 14:35           ` Josh Boyer
2012-12-06 18:14             ` Greg KH
2012-12-09 23:25       ` Ben Hutchings
2012-12-03 14:32 ` [ 68/89] radeon: add AGPMode 1 quirk for RV250 Ben Hutchings
2012-12-03 14:32 ` [ 69/89] x86-32: Fix invalid stack address while in softirq Ben Hutchings
2012-12-03 14:32 ` [ 70/89] x86-32: Export kernel_stack_pointer() for modules Ben Hutchings
2012-12-03 14:32 ` [ 71/89] x86, microcode, AMD: Add support for family 16h processors Ben Hutchings
2012-12-03 14:32 ` [ 72/89] ALSA: hda - Add new codec ALC283 ALC290 support Ben Hutchings
2012-12-03 14:32 ` [ 73/89] ALSA: hda - Add support for Realtek ALC292 Ben Hutchings
2012-12-03 14:33 ` [ 74/89] selinux: fix sel_netnode_insert() suspicious rcu dereference Ben Hutchings
2012-12-03 14:33 ` [ 75/89] Dove: Attempt to fix PMU/RTC interrupts Ben Hutchings
2012-12-03 14:33 ` [ 76/89] Dove: Fix irq_to_pmu() Ben Hutchings
2012-12-03 14:33 ` [ 77/89] ARM: Kirkwood: Update PCI-E fixup Ben Hutchings
2012-12-03 14:33 ` [ 78/89] [PARISC] fix user-triggerable panic on parisc Ben Hutchings
2012-12-03 14:33 ` [ 79/89] dm: fix deadlock with request based dm and queue request_fn recursion Ben Hutchings
2012-12-03 14:33 ` [ 80/89] block: Dont access request after it might be freed Ben Hutchings
2012-12-03 14:33 ` [ 81/89] jbd: Fix lock ordering bug in journal_unmap_buffer() Ben Hutchings
2012-12-03 14:33 ` [ 82/89] can: bcm: initialize ifindex for timeouts without previous frame reception Ben Hutchings
2012-12-03 14:33 ` [ 83/89] futex: avoid wake_futex() for a PI futex_q Ben Hutchings
2012-12-03 14:33 ` [ 84/89] mm/vmemmap: fix wrong use of virt_to_page Ben Hutchings
2012-12-03 14:33 ` [ 85/89] mm: vmscan: fix endless loop in kswapd balancing Ben Hutchings
2012-12-03 14:33 ` [ 86/89] mm: soft offline: split thp at the beginning of soft_offline_page() Ben Hutchings
2012-12-03 14:33 ` [ 87/89] workqueue: exit rescuer_thread() as TASK_RUNNING Ben Hutchings
2012-12-03 14:33 ` [ 88/89] intel_idle: initial IVB support Ben Hutchings
2012-12-03 14:33 ` [ 89/89] intel_idle: enable IVB Xeon support Ben Hutchings
2012-12-03 15:09 ` [ 00/89] 3.2.35-stable review Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121203143147.095347645@decadent.org.uk \
    --to=ben@decadent.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=dvhart@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).