linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Hutchings <ben@decadent.org.uk>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: akpm@linux-foundation.org,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Hugh Dickins" <hughd@google.com>,
	dave@stgolabs.net,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Chris Mason" <clm@fb.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Darren Hart" <dvhart@linux.intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Mel Gorman" <mgorman@suse.de>, "Chenbo Feng" <fengc@google.com>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Davidlohr Bueso" <dbueso@suse.de>
Subject: [PATCH 3.16 12/63] futex: Remove requirement for lock_page() in get_futex_key()
Date: Sat, 22 Sep 2018 01:15:42 +0100	[thread overview]
Message-ID: <lsq.1537575342.285933720@decadent.org.uk> (raw)
In-Reply-To: <lsq.1537575341.194909669@decadent.org.uk>

3.16.58-rc1 review patch.  If anyone has any objections, please let me know.

------------------

From: Mel Gorman <mgorman@suse.de>

commit 65d8fc777f6dcfee12785c057a6b57f679641c90 upstream.

When dealing with key handling for shared futexes, we can drastically reduce
the usage/need of the page lock. 1) For anonymous pages, the associated futex
object is the mm_struct which does not require the page lock. 2) For inode
based, keys, we can check under RCU read lock if the page mapping is still
valid and take reference to the inode. This just leaves one rare race that
requires the page lock in the slow path when examining the swapcache.

Additionally realtime users currently have a problem with the page lock being
contended for unbounded periods of time during futex operations.

Task A
     get_futex_key()
     lock_page()
    ---> preempted

Now any other task trying to lock that page will have to wait until
task A gets scheduled back in, which is an unbound time.

With this patch, we pretty much have a lockless futex_get_key().

Experiments show that this patch can boost/speedup the hashing of shared
futexes with the perf futex benchmarks (which is good for measuring such
change) by up to 45% when there are high (> 100) thread counts on a 60 core
Westmere. Lower counts are pretty much in the noise range or less than 10%,
but mid range can be seen at over 30% overall throughput (hash ops/sec).
This makes anon-mem shared futexes much closer to its private counterpart.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Ported on top of thp refcount rework, changelog, comments, fixes. ]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Mason <clm@fb.com>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: s/READ_ONCE/ACCESS_ONCE/]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 kernel/futex.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 91 insertions(+), 7 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -394,6 +394,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
 	struct page *page, *page_head;
+	struct address_space *mapping;
 	int err, ro = 0;
 
 	/*
@@ -472,7 +473,19 @@ again:
 	}
 #endif
 
-	lock_page(page_head);
+	/*
+	 * The treatment of mapping from this point on is critical. The page
+	 * lock protects many things but in this context the page lock
+	 * stabilizes mapping, prevents inode freeing in the shared
+	 * file-backed region case and guards against movement to swap cache.
+	 *
+	 * Strictly speaking the page lock is not needed in all cases being
+	 * considered here and page lock forces unnecessarily serialization
+	 * From this point on, mapping will be re-verified if necessary and
+	 * page lock will be acquired only if it is unavoidable
+	 */
+
+	mapping = ACCESS_ONCE(page_head->mapping);
 
 	/*
 	 * If page_head->mapping is NULL, then it cannot be a PageAnon
@@ -489,18 +502,31 @@ again:
 	 * shmem_writepage move it from filecache to swapcache beneath us:
 	 * an unlikely race, but we do need to retry for page_head->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
+	if (unlikely(!mapping)) {
+		int shmem_swizzled;
+
+		/*
+		 * Page lock is required to identify which special case above
+		 * applies. If this is really a shmem page then the page lock
+		 * will prevent unexpected transitions.
+		 */
+		lock_page(page);
+		shmem_swizzled = PageSwapCache(page) || page->mapping;
 		unlock_page(page_head);
 		put_page(page_head);
+
 		if (shmem_swizzled)
 			goto again;
+
 		return -EFAULT;
 	}
 
 	/*
 	 * Private mappings are handled in a simple way.
 	 *
+	 * If the futex key is stored on an anonymous page, then the associated
+	 * object is the mm which is implicitly pinned by the calling process.
+	 *
 	 * NOTE: When userspace waits on a MAP_SHARED mapping, even if
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
@@ -518,16 +544,74 @@ again:
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
+
+		get_futex_key_refs(key); /* implies smp_mb(); (B) */
+
 	} else {
+		struct inode *inode;
+
+		/*
+		 * The associated futex object in this case is the inode and
+		 * the page->mapping must be traversed. Ordinarily this should
+		 * be stabilised under page lock but it's not strictly
+		 * necessary in this case as we just want to pin the inode, not
+		 * update the radix tree or anything like that.
+		 *
+		 * The RCU read lock is taken as the inode is finally freed
+		 * under RCU. If the mapping still matches expectations then the
+		 * mapping->host can be safely accessed as being a valid inode.
+		 */
+		rcu_read_lock();
+
+		if (ACCESS_ONCE(page_head->mapping) != mapping) {
+			rcu_read_unlock();
+			put_page(page_head);
+
+			goto again;
+		}
+
+		inode = ACCESS_ONCE(mapping->host);
+		if (!inode) {
+			rcu_read_unlock();
+			put_page(page_head);
+
+			goto again;
+		}
+
+		/*
+		 * Take a reference unless it is about to be freed. Previously
+		 * this reference was taken by ihold under the page lock
+		 * pinning the inode in place so i_lock was unnecessary. The
+		 * only way for this check to fail is if the inode was
+		 * truncated in parallel so warn for now if this happens.
+		 *
+		 * We are not calling into get_futex_key_refs() in file-backed
+		 * cases, therefore a successful atomic_inc return below will
+		 * guarantee that get_futex_key() will still imply smp_mb(); (B).
+		 */
+		if (WARN_ON_ONCE(!atomic_inc_not_zero(&inode->i_count))) {
+			rcu_read_unlock();
+			put_page(page_head);
+
+			goto again;
+		}
+
+		/* Should be impossible but lets be paranoid for now */
+		if (WARN_ON_ONCE(inode->i_mapping != mapping)) {
+			err = -EFAULT;
+			rcu_read_unlock();
+			iput(inode);
+
+			goto out;
+		}
+
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = inode;
 		key->shared.pgoff = basepage_index(page);
+		rcu_read_unlock();
 	}
 
-	get_futex_key_refs(key); /* implies MB (B) */
-
 out:
-	unlock_page(page_head);
 	put_page(page_head);
 	return err;
 }


  parent reply	other threads:[~2018-09-22  0:25 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-22  0:15 [PATCH 3.16 00/63] 3.16.58-rc1 review Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 51/63] xfs: catch inode allocation state mismatch corruption Ben Hutchings
2018-09-22  5:25   ` Dave Chinner
2018-09-22 20:57     ` Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 07/63] usbip: usbip_host: refine probe and disconnect debug msgs to be useful Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 41/63] USB: yurex: fix out-of-bounds uaccess in read handler Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 54/63] seccomp: create internal mode-setting function Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 45/63] x86/paravirt: Fix spectre-v2 mitigations for paravirt guests Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 36/63] jbd2: don't mark block as modified if the handle is out of credits Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 18/63] sr: pass down correctly sized SCSI sense buffer Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 06/63] usbip: usbip_host: fix to hold parent lock for device_attach() calls Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 29/63] ext4: make sure bitmaps and the inode table don't overlap with bg descriptors Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 63/63] mm: get rid of vmacache_flush_all() entirely Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 26/63] ext4: verify the depth of extent tree in ext4_find_extent() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 38/63] Fix up non-directory creation in SGID directories Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 56/63] seccomp: split mode setting routines Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 13/63] futex: Remove unnecessary warning from get_futex_key Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 21/63] Bluetooth: hidp: buffer overflow in hidp_process_report Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 32/63] ext4: always verify the magic number in xattr blocks Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 23/63] xfs: set format back to extents if xfs_bmap_extents_to_btree Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 08/63] usbip: usbip_host: delete device from busid_table after rebind Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 60/63] x86/cpu/AMD: Fix erratum 1076 (CPB bit) Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 57/63] seccomp: add "seccomp" syscall Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 20/63] scsi: sg: allocate with __GFP_ZERO in sg_build_indirect() Ben Hutchings
2018-09-22  0:19   ` syzbot
2018-09-22  0:15 ` [PATCH 3.16 04/63] net: Set sk_prot_creator when cloning sockets to the right proto Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 33/63] ext4: never move the system.data xattr out of the inode body Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 55/63] seccomp: extract check/assign mode helpers Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 16/63] KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 35/63] ext4: add more inode number paranoia checks Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 42/63] ALSA: rawmidi: Change resized buffers atomically Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 27/63] ext4: always check block group bounds in ext4_init_block_bitmap() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 25/63] ext4: fix check to prevent initializing reserved inodes Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 14/63] KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 15/63] KVM: x86: introduce linear_{read,write}_system Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 61/63] x86/cpu/intel: Add Knights Mill to Intel family Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 40/63] infiniband: fix a possible use-after-free bug Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 09/63] usbip: usbip_host: run rebind from exit when module is removed Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 11/63] usbip: usbip_host: fix bad unlock balance during stub_probe() Ben Hutchings
2018-09-22  0:15 ` Ben Hutchings [this message]
2018-09-22  0:15 ` [PATCH 3.16 30/63] ext4: fix false negatives *and* false positives in ext4_check_descriptors() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 37/63] ext4: avoid running out of journal credits when appending to an inline file Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 47/63] uas: replace WARN_ON_ONCE() with lockdep_assert_held() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 44/63] x86/speculation: Protect against userspace-userspace spectreRSB Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 49/63] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 24/63] ext4: only look at the bg_flags field if it is valid Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 53/63] xfs: don't call xfs_da_shrink_inode with NULL bp Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 17/63] kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 43/63] x86/speculation: Clean up various Spectre related details Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 52/63] xfs: validate cached inodes are free when allocated Ben Hutchings
2018-09-22  5:26   ` Dave Chinner
2018-09-22 20:57     ` Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 48/63] video: uvesafb: Fix integer overflow in allocation Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 19/63] jfs: Fix inconsistency between memory allocation and ea_buf->max_size Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 22/63] scsi: libsas: defer ata device eh commands to libata Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 05/63] usbip: fix error handling in stub_probe() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 10/63] usbip: usbip_host: fix NULL-ptr deref and use-after-free errors Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 46/63] cdrom: Fix info leak/OOB read in cdrom_ioctl_drive_status Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 28/63] ext4: don't allow r/w mounts if metadata blocks overlap the superblock Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 03/63] Revert "vti4: Don't override MTU passed on link creation via IFLA_MTU" Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 34/63] ext4: clear i_data in ext4_inode_info when removing inline data Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 01/63] x86/fpu: Fix the 'nofxsr' boot parameter to also clear X86_FEATURE_FXSR_OPT Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 02/63] x86/fpu: Default eagerfpu if FPU and FXSR are enabled Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 58/63] x86/process: Optimize TIF checks in __switch_to_xtra() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 31/63] ext4: add corruption check in ext4_xattr_set_entry() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 39/63] x86/entry/64: Remove %ebx handling from error_entry/exit Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 59/63] x86/process: Correct and optimize TIF_BLOCKSTEP switch Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 50/63] hfsplus: fix NULL dereference in hfsplus_lookup() Ben Hutchings
2018-09-22  0:15 ` [PATCH 3.16 62/63] KVM: x86: introduce num_emulated_msrs Ben Hutchings
2018-09-22 12:28 ` [PATCH 3.16 00/63] 3.16.58-rc1 review Guenter Roeck
2018-09-22 21:03   ` Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=lsq.1537575342.285933720@decadent.org.uk \
    --to=ben@decadent.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=clm@fb.com \
    --cc=dave@stgolabs.net \
    --cc=dbueso@suse.de \
    --cc=dvhart@linux.intel.com \
    --cc=fengc@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).