linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path
@ 2018-12-11 17:37 Josef Bacik
  2018-12-11 17:37 ` [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault Josef Bacik
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Josef Bacik @ 2018-12-11 17:37 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

Here's the latest version, slimmed down a bit from my last submission with more
details in the changelogs as requested.

v4->v5:
- dropped the cached_page infrastructure and the addition of the
  handle_mm_fault_cacheable helper as it had no discernable bearing on
  performance in my performance testing.
- reworked the page lock dropping logic in order to be it's own helper, which
  comments describing how to use it.
- added more details to the changelog for the fpin patch.
- added a patch to cleanup the arguments for the readahead functions for mmap as
  per Jan's suggestion.

v3->v4:
- dropped the ->page_mkwrite portion of these patches, we don't actually see
  issues with mkwrite in production, and I kept running into corner cases where
  I missed something important.  I want to wait on that part until I have a real
  reason to do the work so I can have a solid test in place.
- completely reworked how we drop the mmap_sem in filemap_fault and cleaned it
  up a bit.  Once I started actually testing this with our horrifying reproducer
  I saw a bunch of places where we still ended up doing IO under the mmap_sem
  because I had missed a few corner cases.  Fixed this by reworking
  filemap_fault to only return RETRY once it has a completely uptodate page
  ready to be used.
- lots more testing, including production testing.

v2->v3:
- dropped the RFC, ready for a real review.
- fixed a kbuild error for !MMU configs.
- dropped the swapcache patches since Johannes is still working on those parts.

v1->v2:
- reworked so it only affects x86, since its the only arch I can build and test.
- fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down
  to ->page_mkwrite.
- fixed error handling in do_page_mkwrite/callers to explicitly catch
  VM_FAULT_RETRY.
- fixed btrfs to set ->cached_page properly.

-- Original message --

Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions.  Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.

We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such.  This involves running 'ps' or other such
utilities.  These utilities will often walk /proc/<pid>/whatever, and these
files can sometimes need to down_read(&task->mmap_sem).  Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.

This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer.  This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal.  But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.

In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO.  This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock.  With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.  Let me
know what you think.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault
  2018-12-11 17:37 [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Josef Bacik
@ 2018-12-11 17:37 ` Josef Bacik
  2018-12-11 17:38 ` [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers Josef Bacik
  2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
  2 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2018-12-11 17:37 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

If we do not have a page at filemap_fault time we'll do this weird
forced page_cache_read thing to populate the page, and then drop it
again and loop around and find it.  This makes for 2 ways we can read a
page in filemap_fault, and it's not really needed.  Instead add a
FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked
page that's in pagecache.  Then use the normal page locking and readpage
logic already in filemap_fault.  This simplifies the no page in page
cache case significantly.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 73 ++++++++++---------------------------------------
 2 files changed, 16 insertions(+), 58 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..b13c2442281f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -252,6 +252,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_WRITE		0x00000008
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
+#define FGP_FOR_MMAP		0x00000040
 
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		int fgp_flags, gfp_t cache_gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..03bce38d8f2b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1503,6 +1503,9 @@ EXPORT_SYMBOL(find_lock_entry);
  *   @gfp_mask and added to the page cache and the VM's LRU
  *   list. The page is returned locked and with an increased
  *   refcount. Otherwise, NULL is returned.
+ * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
+ *   its own locking dance if the page is already in cache, or unlock the page
+ *   before returning if we had to add the page to pagecache.
  *
  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  * if the GFP flags specified for FGP_CREAT are atomic.
@@ -1555,7 +1558,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		if (!page)
 			return NULL;
 
-		if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
+		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
 		/* Init accessed so avoid atomic mark_page_accessed later */
@@ -1569,6 +1572,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 			if (err == -EEXIST)
 				goto repeat;
 		}
+
+		/*
+		 * add_to_page_cache_lru lock's the page, and for mmap we expect
+		 * a unlocked page.
+		 */
+		if (fgp_flags & FGP_FOR_MMAP)
+			unlock_page(page);
 	}
 
 	return page;
@@ -2293,39 +2303,6 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
-/**
- * page_cache_read - adds requested page to the page cache if not already there
- * @file:	file to read
- * @offset:	page index
- * @gfp_mask:	memory allocation flags
- *
- * This adds the requested page to the page cache if it isn't already there,
- * and schedules an I/O to read in its contents from disk.
- */
-static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
-{
-	struct address_space *mapping = file->f_mapping;
-	struct page *page;
-	int ret;
-
-	do {
-		page = __page_cache_alloc(gfp_mask);
-		if (!page)
-			return -ENOMEM;
-
-		ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
-		if (ret == 0)
-			ret = mapping->a_ops->readpage(file, page);
-		else if (ret == -EEXIST)
-			ret = 0; /* losing race to add is OK */
-
-		put_page(page);
-
-	} while (ret == AOP_TRUNCATED_PAGE);
-
-	return ret;
-}
-
 #define MMAP_LOTSAMISS  (100)
 
 /*
@@ -2449,9 +2426,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
+		page = pagecache_get_page(mapping, offset,
+					  FGP_CREAT|FGP_FOR_MMAP,
+					  vmf->gfp_mask);
 		if (!page)
-			goto no_cached_page;
+			return vmf_error(-ENOMEM);
 	}
 
 	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
@@ -2488,28 +2467,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
-no_cached_page:
-	/*
-	 * We're only likely to ever get here if MADV_RANDOM is in
-	 * effect.
-	 */
-	error = page_cache_read(file, offset, vmf->gfp_mask);
-
-	/*
-	 * The page we want has now been added to the page cache.
-	 * In the unlikely event that someone removed it in the
-	 * meantime, we'll just come back here and read it again.
-	 */
-	if (error >= 0)
-		goto retry_find;
-
-	/*
-	 * An error return from page_cache_read can result if the
-	 * system is low on memory, or a problem occurs while trying
-	 * to schedule I/O.
-	 */
-	return vmf_error(error);
-
 page_not_uptodate:
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers
  2018-12-11 17:37 [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Josef Bacik
  2018-12-11 17:37 ` [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault Josef Bacik
@ 2018-12-11 17:38 ` Josef Bacik
  2018-12-12 10:10   ` Jan Kara
  2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
  2 siblings, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2018-12-11 17:38 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

All of the arguments to these functions come from the vmf, and the
following patches are going to add more arguments.  Cut down on the
amount of arguments passed by simply passing in the vmf to these two
helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 03bce38d8f2b..8fc45f24b201 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2309,20 +2309,20 @@ EXPORT_SYMBOL(generic_file_read_iter);
  * Synchronous readahead happens when we don't even find
  * a page in the page cache at all.
  */
-static void do_sync_mmap_readahead(struct vm_area_struct *vma,
-				   struct file_ra_state *ra,
-				   struct file *file,
-				   pgoff_t offset)
+static void do_sync_mmap_readahead(struct vm_fault *vmf)
 {
+	struct file *file = vmf->vma->vm_file;
+	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
-	if (vma->vm_flags & VM_RAND_READ)
+	if (vmf->vma->vm_flags & VM_RAND_READ)
 		return;
 	if (!ra->ra_pages)
 		return;
 
-	if (vma->vm_flags & VM_SEQ_READ) {
+	if (vmf->vma->vm_flags & VM_SEQ_READ) {
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -2352,16 +2352,16 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
  * Asynchronous readahead happens when we find the page and PG_readahead,
  * so we want to possibly extend the readahead further..
  */
-static void do_async_mmap_readahead(struct vm_area_struct *vma,
-				    struct file_ra_state *ra,
-				    struct file *file,
-				    struct page *page,
-				    pgoff_t offset)
+static void do_async_mmap_readahead(struct vm_fault *vmf,
+				    struct page *page)
 {
+	struct file *file = vmf->vma->vm_file;
+	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
-	if (vma->vm_flags & VM_RAND_READ)
+	if (vmf->vma->vm_flags & VM_RAND_READ)
 		return;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
@@ -2418,10 +2418,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+		do_async_mmap_readahead(vmf, page);
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
+		do_sync_mmap_readahead(vmf);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations
  2018-12-11 17:37 [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Josef Bacik
  2018-12-11 17:37 ` [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault Josef Bacik
  2018-12-11 17:38 ` [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers Josef Bacik
@ 2018-12-11 17:38 ` Josef Bacik
  2018-12-11 21:15   ` Andrew Morton
  2018-12-12 15:27   ` [PATCH][v6] " Josef Bacik
  2 siblings, 2 replies; 10+ messages in thread
From: Josef Bacik @ 2018-12-11 17:38 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does
anything.  In the case that the box is under extreme memory or IO
pressure we may end up not reading anything at all for readahead, which
means we will end up reading in the page under the mmap_sem.

Even if the readahead does something, it could get throttled because of
io pressure on the system and the process is in a lower priority cgroup.

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does
a lot of web traffic.  This large company has load balancing logic in
it's core web server, cause some engineer thought this was a brilliant
plan.  This load balancing logic gets statistics from /proc about the
system, which trip over processes mmap_sem for various reasons.  Now the
web server application is in a protected cgroup, but these other
processes may not be, and if they are being throttled while their
mmap_sem is held we'll stall, and cause this nice death spiral.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes
while issuing readahead, locking the page, or needing to call ->readpage
because readahead did not occur.  Then once we have a fully uptodate
page we can return with VM_FAULT_RETRY and come back again to find our
nicely in-cache page that was gotten outside of the mmap_sem.

This patch also adds a new helper for locking the page with the mmap_sem
dropped.  This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an
error) before it was unlocked.  However a forthcoming patchset will
change this with the ability to abort read-ahead bio's if necessary,
making it more likely that we could contend for a page lock and still
have a not uptodate page.  This allows us to deal with this case by
grabbing the lock and issuing the IO without the mmap_sem held, and then
returning VM_FAULT_RETRY to come back around.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 92 insertions(+), 12 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8fc45f24b201..10084168eff1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,28 +2304,76 @@ EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
+static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
+					     struct file *fpin)
+{
+	int flags = vmf->flags;
+	if (fpin)
+		return fpin;
+	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
+	    FAULT_FLAG_ALLOW_RETRY) {
+		fpin = get_file(vmf->vma->vm_file);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+	}
+	return fpin;
+}
+
+/*
+ * Works similar to lock_page_or_retry, except it will pin the file and drop the
+ * mmap_sem if necessary and then lock the page, and return 1 in this case.
+ * This means the caller needs to deal with the fpin appropriately.  0 return is
+ * the same as in lock_page_or_retry.
+ */
+static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
+				     struct file **fpin)
+{
+	if (trylock_page(page))
+		return 1;
+
+	*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
+	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+		return 0;
+	if (vmf->flags & FAULT_FLAG_KILLABLE) {
+		if (__lock_page_killable(page)) {
+			/*
+			 * We didn't have the right flags to drop the mmap_sem,
+			 * but all fault_handlers only check for fatal signals
+			 * if we return VM_FAULT_RETRY, so we need to drop the
+			 * mmap_sem here and return 0 if we don't have a fpin.
+			 */
+			if (*fpin == NULL)
+				up_read(&vmf->vma->vm_mm->mmap_sem);
+			return 0;
+		}
+	} else
+		__lock_page(page);
+	return 1;
+}
+
 
 /*
  * Synchronous readahead happens when we don't even find
  * a page in the page cache at all.
  */
-static void do_sync_mmap_readahead(struct vm_fault *vmf)
+static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (!ra->ra_pages)
-		return;
+		return fpin;
 
 	if (vmf->vma->vm_flags & VM_SEQ_READ) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
-		return;
+		return fpin;
 	}
 
 	/* Avoid banging the cache line if not needed */
@@ -2337,37 +2385,43 @@ static void do_sync_mmap_readahead(struct vm_fault *vmf)
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
-		return;
+		return fpin;
 
 	/*
 	 * mmap read-around
 	 */
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
+	return fpin;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
  * so we want to possibly extend the readahead further..
  */
-static void do_async_mmap_readahead(struct vm_fault *vmf,
-				    struct page *page)
+static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
+					    struct page *page)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
+	return fpin;
 }
 
 /**
@@ -2397,6 +2451,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	int error;
 	struct file *file = vmf->vma->vm_file;
+	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
@@ -2418,10 +2473,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf, page);
+		fpin = do_async_mmap_readahead(vmf, page);
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf);
+		fpin = do_sync_mmap_readahead(vmf);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
@@ -2433,7 +2488,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 			return vmf_error(-ENOMEM);
 	}
 
-	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
+	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin)) {
 		put_page(page);
 		return ret | VM_FAULT_RETRY;
 	}
@@ -2453,6 +2508,16 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	/*
+	 * We've made it this far and we had to drop our mmap_sem, now is the
+	 * time to return to the upper layer and have it re-find the vma and
+	 * redo the fault.
+	 */
+	if (fpin) {
+		unlock_page(page);
+		goto out_retry;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
@@ -2475,12 +2540,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * and we need to check for errors.
 	 */
 	ClearPageError(page);
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			error = -EIO;
 	}
+	if (fpin)
+		goto out_retry;
 	put_page(page);
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2489,6 +2557,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
+
+out_retry:
+	/*
+	 * We dropped the mmap_sem, we need to return to the fault handler to
+	 * re-find the vma and come back and find our hopefully still populated
+	 * page.
+	 */
+	if (page)
+		put_page(page);
+	if (fpin)
+		fput(fpin);
+	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations
  2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
@ 2018-12-11 21:15   ` Andrew Morton
  2018-12-12 10:36     ` Jan Kara
  2018-12-12 15:27   ` [PATCH][v6] " Josef Bacik
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2018-12-11 21:15 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, david, linux-fsdevel,
	linux-mm, riel, jack

On Tue, 11 Dec 2018 12:38:01 -0500 Josef Bacik <josef@toxicpanda.com> wrote:

> Currently we only drop the mmap_sem if there is contention on the page
> lock.  The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
> 
> The problem with this is the assumption that the readahead does
> anything.  In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.
> 
> Even if the readahead does something, it could get throttled because of
> io pressure on the system and the process is in a lower priority cgroup.
> 
> Holding the mmap_sem while doing IO is problematic because it can cause
> system-wide priority inversions.  Consider some large company that does
> a lot of web traffic.  This large company has load balancing logic in
> it's core web server, cause some engineer thought this was a brilliant
> plan.  This load balancing logic gets statistics from /proc about the
> system, which trip over processes mmap_sem for various reasons.  Now the
> web server application is in a protected cgroup, but these other
> processes may not be, and if they are being throttled while their
> mmap_sem is held we'll stall, and cause this nice death spiral.
> 
> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time.  This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur.  Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
> 
> This patch also adds a new helper for locking the page with the mmap_sem
> dropped.  This doesn't make sense currently as generally speaking if the
> page is already locked it'll have been read in (unless there was an
> error) before it was unlocked.  However a forthcoming patchset will
> change this with the ability to abort read-ahead bio's if necessary,
> making it more likely that we could contend for a page lock and still
> have a not uptodate page.  This allows us to deal with this case by
> grabbing the lock and issuing the IO without the mmap_sem held, and then
> returning VM_FAULT_RETRY to come back around.
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,28 +2304,76 @@ EXPORT_SYMBOL(generic_file_read_iter);
>  
>  #ifdef CONFIG_MMU
>  #define MMAP_LOTSAMISS  (100)
> +static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
> +					     struct file *fpin)
> +{
> +	int flags = vmf->flags;
> +	if (fpin)
> +		return fpin;

I think a comment here wouldn't hurt: explain waht's going on, why we're
handling the fault flag in this fashion.  That's kinda covered in the
lock_page_maybe_drop_mmap() description, but this code is fairly
tricky-looking.


> +	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> +	    FAULT_FLAG_ALLOW_RETRY) {
> +		fpin = get_file(vmf->vma->vm_file);
> +		up_read(&vmf->vma->vm_mm->mmap_sem);
> +	}
> +	return fpin;
> +}
> +
> +/*
> + * Works similar to lock_page_or_retry, except it will pin the file and drop the
> + * mmap_sem if necessary and then lock the page, and return 1 in this case.

This isn't true in the case where the trylock_page() succeeded.  Can we
expand on that case here?

> + * This means the caller needs to deal with the fpin appropriately.  0 return is
> + * the same as in lock_page_or_retry.
> + */
> +static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
> +				     struct file **fpin)
> +{
> +	if (trylock_page(page))
> +		return 1;
> +
> +	*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
> +	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> +		return 0;

maybe_unlock_mmap_for_io() doesn't do anything if
FAULT_FLAG_RETRY_NOWAIT was set, so can we swap the above two
statements?

> +	if (vmf->flags & FAULT_FLAG_KILLABLE) {
> +		if (__lock_page_killable(page)) {
> +			/*
> +			 * We didn't have the right flags to drop the mmap_sem,
> +			 * but all fault_handlers only check for fatal signals
> +			 * if we return VM_FAULT_RETRY, so we need to drop the
> +			 * mmap_sem here and return 0 if we don't have a fpin.
> +			 */
> +			if (*fpin == NULL)
> +				up_read(&vmf->vma->vm_mm->mmap_sem);
> +			return 0;
> +		}
> +	} else
> +		__lock_page(page);
> +	return 1;
> +}
> +
>  
>  /*
>   * Synchronous readahead happens when we don't even find
>   * a page in the page cache at all.
>   */
> -static void do_sync_mmap_readahead(struct vm_fault *vmf)
> +static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)

Let's explain the newly-added return value in the comment?  Under what
circumstances is it NULL, etc.

>  {
>  	struct file *file = vmf->vma->vm_file;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct address_space *mapping = file->f_mapping;
> +	struct file *fpin = NULL;
>  	pgoff_t offset = vmf->pgoff;
>  
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vmf->vma->vm_flags & VM_RAND_READ)
> -		return;
> +		return fpin;
>  	if (!ra->ra_pages)
> -		return;
> +		return fpin;
>  
>  	if (vmf->vma->vm_flags & VM_SEQ_READ) {
> +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  		page_cache_sync_readahead(mapping, ra, file, offset,
>  					  ra->ra_pages);
> -		return;
> +		return fpin;
>  	}
>  
>  	/* Avoid banging the cache line if not needed */
> @@ -2337,37 +2385,43 @@ static void do_sync_mmap_readahead(struct vm_fault *vmf)
>  	 * stop bothering with read-ahead. It will only hurt.
>  	 */
>  	if (ra->mmap_miss > MMAP_LOTSAMISS)
> -		return;
> +		return fpin;
>  
>  	/*
>  	 * mmap read-around
>  	 */
> +	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
>  	ra->size = ra->ra_pages;
>  	ra->async_size = ra->ra_pages / 4;
>  	ra_submit(ra, mapping, file);
> +	return fpin;
>  }
>  
>  /*
>   * Asynchronous readahead happens when we find the page and PG_readahead,
>   * so we want to possibly extend the readahead further..
>   */
> -static void do_async_mmap_readahead(struct vm_fault *vmf,
> -				    struct page *page)
> +static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> +					    struct page *page)
>  {
>  	struct file *file = vmf->vma->vm_file;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct address_space *mapping = file->f_mapping;
> +	struct file *fpin = NULL;
>  	pgoff_t offset = vmf->pgoff;
>  
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vmf->vma->vm_flags & VM_RAND_READ)
> -		return;
> +		return fpin;
>  	if (ra->mmap_miss > 0)
>  		ra->mmap_miss--;
> -	if (PageReadahead(page))
> +	if (PageReadahead(page)) {
> +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  		page_cache_async_readahead(mapping, ra, file,
>  					   page, offset, ra->ra_pages);
> +	}
> +	return fpin;
>  }
>  
>  /**
> @@ -2397,6 +2451,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  {
>  	int error;
>  	struct file *file = vmf->vma->vm_file;
> +	struct file *fpin = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct inode *inode = mapping->host;
> @@ -2418,10 +2473,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		 * We found the page, so try async readahead before
>  		 * waiting for the lock.
>  		 */
> -		do_async_mmap_readahead(vmf, page);
> +		fpin = do_async_mmap_readahead(vmf, page);
>  	} else if (!page) {
>  		/* No page in the page cache at all */
> -		do_sync_mmap_readahead(vmf);
> +		fpin = do_sync_mmap_readahead(vmf);
>  		count_vm_event(PGMAJFAULT);
>  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>  		ret = VM_FAULT_MAJOR;
> @@ -2433,7 +2488,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  			return vmf_error(-ENOMEM);

hm, how does this work.  We might have taken a ref on the file and that
ref is recorded in fpin but an error here causes us to lose track of
that elevated refcount?

>  	}
>  
> -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> +	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin)) {
>  		put_page(page);
>  		return ret | VM_FAULT_RETRY;
>  	}
> @@ -2453,6 +2508,16 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	if (unlikely(!PageUptodate(page)))
>  		goto page_not_uptodate;
>  
> +	/*
> +	 * We've made it this far and we had to drop our mmap_sem, now is the
> +	 * time to return to the upper layer and have it re-find the vma and
> +	 * redo the fault.
> +	 */
> +	if (fpin) {
> +		unlock_page(page);
> +		goto out_retry;
> +	}
> +
>  	/*
>  	 * Found the page and have a reference on it.
>  	 * We must recheck i_size under page lock.
> @@ -2475,12 +2540,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 * and we need to check for errors.
>  	 */
>  	ClearPageError(page);
> +	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	error = mapping->a_ops->readpage(file, page);
>  	if (!error) {
>  		wait_on_page_locked(page);
>  		if (!PageUptodate(page))
>  			error = -EIO;
>  	}
> +	if (fpin)
> +		goto out_retry;
>  	put_page(page);
>  
>  	if (!error || error == AOP_TRUNCATED_PAGE)
> @@ -2489,6 +2557,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	/* Things didn't work out. Return zero to tell the mm layer so. */
>  	shrink_readahead_size_eio(file, ra);
>  	return VM_FAULT_SIGBUS;
> +
> +out_retry:
> +	/*
> +	 * We dropped the mmap_sem, we need to return to the fault handler to
> +	 * re-find the vma and come back and find our hopefully still populated
> +	 * page.
> +	 */
> +	if (page)
> +		put_page(page);
> +	if (fpin)
> +		fput(fpin);
> +	return ret | VM_FAULT_RETRY;
>  }
>  EXPORT_SYMBOL(filemap_fault);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers
  2018-12-11 17:38 ` [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers Josef Bacik
@ 2018-12-12 10:10   ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2018-12-12 10:10 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

On Tue 11-12-18 12:38:00, Josef Bacik wrote:
> All of the arguments to these functions come from the vmf, and the
> following patches are going to add more arguments.  Cut down on the
> amount of arguments passed by simply passing in the vmf to these two
> helpers.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/filemap.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 03bce38d8f2b..8fc45f24b201 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2309,20 +2309,20 @@ EXPORT_SYMBOL(generic_file_read_iter);
>   * Synchronous readahead happens when we don't even find
>   * a page in the page cache at all.
>   */
> -static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> -				   struct file_ra_state *ra,
> -				   struct file *file,
> -				   pgoff_t offset)
> +static void do_sync_mmap_readahead(struct vm_fault *vmf)
>  {
> +	struct file *file = vmf->vma->vm_file;
> +	struct file_ra_state *ra = &file->f_ra;
>  	struct address_space *mapping = file->f_mapping;
> +	pgoff_t offset = vmf->pgoff;
>  
>  	/* If we don't want any read-ahead, don't bother */
> -	if (vma->vm_flags & VM_RAND_READ)
> +	if (vmf->vma->vm_flags & VM_RAND_READ)
>  		return;
>  	if (!ra->ra_pages)
>  		return;
>  
> -	if (vma->vm_flags & VM_SEQ_READ) {
> +	if (vmf->vma->vm_flags & VM_SEQ_READ) {
>  		page_cache_sync_readahead(mapping, ra, file, offset,
>  					  ra->ra_pages);
>  		return;
> @@ -2352,16 +2352,16 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
>   * Asynchronous readahead happens when we find the page and PG_readahead,
>   * so we want to possibly extend the readahead further..
>   */
> -static void do_async_mmap_readahead(struct vm_area_struct *vma,
> -				    struct file_ra_state *ra,
> -				    struct file *file,
> -				    struct page *page,
> -				    pgoff_t offset)
> +static void do_async_mmap_readahead(struct vm_fault *vmf,
> +				    struct page *page)
>  {
> +	struct file *file = vmf->vma->vm_file;
> +	struct file_ra_state *ra = &file->f_ra;
>  	struct address_space *mapping = file->f_mapping;
> +	pgoff_t offset = vmf->pgoff;
>  
>  	/* If we don't want any read-ahead, don't bother */
> -	if (vma->vm_flags & VM_RAND_READ)
> +	if (vmf->vma->vm_flags & VM_RAND_READ)
>  		return;
>  	if (ra->mmap_miss > 0)
>  		ra->mmap_miss--;
> @@ -2418,10 +2418,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		 * We found the page, so try async readahead before
>  		 * waiting for the lock.
>  		 */
> -		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
> +		do_async_mmap_readahead(vmf, page);
>  	} else if (!page) {
>  		/* No page in the page cache at all */
> -		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
> +		do_sync_mmap_readahead(vmf);
>  		count_vm_event(PGMAJFAULT);
>  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>  		ret = VM_FAULT_MAJOR;
> -- 
> 2.14.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations
  2018-12-11 21:15   ` Andrew Morton
@ 2018-12-12 10:36     ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2018-12-12 10:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josef Bacik, kernel-team, hannes, linux-kernel, tj, david,
	linux-fsdevel, linux-mm, riel, jack

On Tue 11-12-18 13:15:19, Andrew Morton wrote:
> On Tue, 11 Dec 2018 12:38:01 -0500 Josef Bacik <josef@toxicpanda.com> wrote:
> 
> > Currently we only drop the mmap_sem if there is contention on the page
> > lock.  The idea is that we issue readahead and then go to lock the page
> > while it is under IO and we want to not hold the mmap_sem during the IO.
> > 
> > The problem with this is the assumption that the readahead does
> > anything.  In the case that the box is under extreme memory or IO
> > pressure we may end up not reading anything at all for readahead, which
> > means we will end up reading in the page under the mmap_sem.
> > 
> > Even if the readahead does something, it could get throttled because of
> > io pressure on the system and the process is in a lower priority cgroup.
> > 
> > Holding the mmap_sem while doing IO is problematic because it can cause
> > system-wide priority inversions.  Consider some large company that does
> > a lot of web traffic.  This large company has load balancing logic in
> > it's core web server, cause some engineer thought this was a brilliant
> > plan.  This load balancing logic gets statistics from /proc about the
> > system, which trip over processes mmap_sem for various reasons.  Now the
> > web server application is in a protected cgroup, but these other
> > processes may not be, and if they are being throttled while their
> > mmap_sem is held we'll stall, and cause this nice death spiral.
> > 
> > Instead rework filemap fault path to drop the mmap sem at any point that
> > we may do IO or block for an extended period of time.  This includes
> > while issuing readahead, locking the page, or needing to call ->readpage
> > because readahead did not occur.  Then once we have a fully uptodate
> > page we can return with VM_FAULT_RETRY and come back again to find our
> > nicely in-cache page that was gotten outside of the mmap_sem.
> > 
> > This patch also adds a new helper for locking the page with the mmap_sem
> > dropped.  This doesn't make sense currently as generally speaking if the
> > page is already locked it'll have been read in (unless there was an
> > error) before it was unlocked.  However a forthcoming patchset will
> > change this with the ability to abort read-ahead bio's if necessary,
> > making it more likely that we could contend for a page lock and still
> > have a not uptodate page.  This allows us to deal with this case by
> > grabbing the lock and issuing the IO without the mmap_sem held, and then
> > returning VM_FAULT_RETRY to come back around.
> > 
> > ...
...
> > @@ -2397,6 +2451,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  {
> >  	int error;
> >  	struct file *file = vmf->vma->vm_file;
> > +	struct file *fpin = NULL;
> >  	struct address_space *mapping = file->f_mapping;
> >  	struct file_ra_state *ra = &file->f_ra;
> >  	struct inode *inode = mapping->host;
> > @@ -2418,10 +2473,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  		 * We found the page, so try async readahead before
> >  		 * waiting for the lock.
> >  		 */
> > -		do_async_mmap_readahead(vmf, page);
> > +		fpin = do_async_mmap_readahead(vmf, page);
> >  	} else if (!page) {
> >  		/* No page in the page cache at all */
> > -		do_sync_mmap_readahead(vmf);
> > +		fpin = do_sync_mmap_readahead(vmf);
> >  		count_vm_event(PGMAJFAULT);
> >  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> >  		ret = VM_FAULT_MAJOR;
> > @@ -2433,7 +2488,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  			return vmf_error(-ENOMEM);
> 
> hm, how does this work.  We might have taken a ref on the file and that
> ref is recorded in fpin but an error here causes us to lose track of
> that elevated refcount?

Yeah, that looks like a bug to me as well.

> >  	}
> >  
> > -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > +	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin)) {
> >  		put_page(page);
> >  		return ret | VM_FAULT_RETRY;
> >  	}

And here can be the same problem. Generally if we went through 'goto
retry_find', we may have file ref already taken but some exit paths don't
drop that ref properly...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH][v6] filemap: drop the mmap_sem for all blocking operations
  2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
  2018-12-11 21:15   ` Andrew Morton
@ 2018-12-12 15:27   ` Josef Bacik
  2018-12-12 23:55     ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2018-12-12 15:27 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-mm, riel, jack

Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does
anything.  In the case that the box is under extreme memory or IO
pressure we may end up not reading anything at all for readahead, which
means we will end up reading in the page under the mmap_sem.

Even if the readahead does something, it could get throttled because of
io pressure on the system and the process is in a lower priority cgroup.

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does
a lot of web traffic.  This large company has load balancing logic in
it's core web server, cause some engineer thought this was a brilliant
plan.  This load balancing logic gets statistics from /proc about the
system, which trip over processes mmap_sem for various reasons.  Now the
web server application is in a protected cgroup, but these other
processes may not be, and if they are being throttled while their
mmap_sem is held we'll stall, and cause this nice death spiral.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes
while issuing readahead, locking the page, or needing to call ->readpage
because readahead did not occur.  Then once we have a fully uptodate
page we can return with VM_FAULT_RETRY and come back again to find our
nicely in-cache page that was gotten outside of the mmap_sem.

This patch also adds a new helper for locking the page with the mmap_sem
dropped.  This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an
error) before it was unlocked.  However a forthcoming patchset will
change this with the ability to abort read-ahead bio's if necessary,
making it more likely that we could contend for a page lock and still
have a not uptodate page.  This allows us to deal with this case by
grabbing the lock and issuing the IO without the mmap_sem held, and then
returning VM_FAULT_RETRY to come back around.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v5->v6:
- added more comments as per Andrew's suggestion.
- fixed the fpin leaks in the two error paths that were pointed out.

 mm/filemap.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 19 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8fc45f24b201..42e03decf20f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,28 +2304,91 @@ EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
+static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
+					     struct file *fpin)
+{
+	int flags = vmf->flags;
+	if (fpin)
+		return fpin;
+
+	/*
+	 * FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or
+	 * anything, so we only pin the file and drop the mmap_sem if only
+	 * FAULT_FLAG_ALLOW_RETRY is set.
+	 */
+	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
+	    FAULT_FLAG_ALLOW_RETRY) {
+		fpin = get_file(vmf->vma->vm_file);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+	}
+	return fpin;
+}
 
 /*
- * Synchronous readahead happens when we don't even find
- * a page in the page cache at all.
+ * lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_sem
+ * @vmf - the vm_fault for this fault.
+ * @page - the page to lock.
+ * @fpin - the pointer to the file we may pin (or is already pinned).
+ *
+ * This works similar to lock_page_or_retry in that it can drop the mmap_sem.
+ * It differs in that it actually returns the page locked if it returns 1 and 0
+ * if it couldn't lock the page.  If we did have to drop the mmap_sem then fpin
+ * will point to the pinned file and needs to be fput()'ed at a later point.
  */
-static void do_sync_mmap_readahead(struct vm_fault *vmf)
+static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
+				     struct file **fpin)
+{
+	if (trylock_page(page))
+		return 1;
+
+	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+		return 0;
+
+	*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
+	if (vmf->flags & FAULT_FLAG_KILLABLE) {
+		if (__lock_page_killable(page)) {
+			/*
+			 * We didn't have the right flags to drop the mmap_sem,
+			 * but all fault_handlers only check for fatal signals
+			 * if we return VM_FAULT_RETRY, so we need to drop the
+			 * mmap_sem here and return 0 if we don't have a fpin.
+			 */
+			if (*fpin == NULL)
+				up_read(&vmf->vma->vm_mm->mmap_sem);
+			return 0;
+		}
+	} else
+		__lock_page(page);
+	return 1;
+}
+
+
+/*
+ * Synchronous readahead happens when we don't even find a page in the page
+ * cache at all.  We don't want to perform IO under the mmap sem, so if we have
+ * to drop the mmap sem we return the file that was pinned in order for us to do
+ * that.  If we didn't pin a file then we return NULL.  The file that is
+ * returned needs to be fput()'ed when we're done with it.
+ */
+static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (!ra->ra_pages)
-		return;
+		return fpin;
 
 	if (vmf->vma->vm_flags & VM_SEQ_READ) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
-		return;
+		return fpin;
 	}
 
 	/* Avoid banging the cache line if not needed */
@@ -2337,37 +2400,44 @@ static void do_sync_mmap_readahead(struct vm_fault *vmf)
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
-		return;
+		return fpin;
 
 	/*
 	 * mmap read-around
 	 */
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
+	return fpin;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
- * so we want to possibly extend the readahead further..
+ * so we want to possibly extend the readahead further.  We return the file that
+ * was pinned if we have to drop the mmap_sem in order to do IO.
  */
-static void do_async_mmap_readahead(struct vm_fault *vmf,
-				    struct page *page)
+static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
+					    struct page *page)
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
+	return fpin;
 }
 
 /**
@@ -2397,6 +2467,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	int error;
 	struct file *file = vmf->vma->vm_file;
+	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
@@ -2418,10 +2489,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf, page);
+		fpin = do_async_mmap_readahead(vmf, page);
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf);
+		fpin = do_sync_mmap_readahead(vmf);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
@@ -2429,14 +2500,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		page = pagecache_get_page(mapping, offset,
 					  FGP_CREAT|FGP_FOR_MMAP,
 					  vmf->gfp_mask);
-		if (!page)
+		if (!page) {
+			if (fpin)
+				goto out_retry;
 			return vmf_error(-ENOMEM);
+		}
 	}
 
-	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
-		put_page(page);
-		return ret | VM_FAULT_RETRY;
-	}
+	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
+		goto out_retry;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
@@ -2453,6 +2525,16 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	/*
+	 * We've made it this far and we had to drop our mmap_sem, now is the
+	 * time to return to the upper layer and have it re-find the vma and
+	 * redo the fault.
+	 */
+	if (fpin) {
+		unlock_page(page);
+		goto out_retry;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
@@ -2475,12 +2557,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * and we need to check for errors.
 	 */
 	ClearPageError(page);
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			error = -EIO;
 	}
+	if (fpin)
+		goto out_retry;
 	put_page(page);
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2489,6 +2574,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
+
+out_retry:
+	/*
+	 * We dropped the mmap_sem, we need to return to the fault handler to
+	 * re-find the vma and come back and find our hopefully still populated
+	 * page.
+	 */
+	if (page)
+		put_page(page);
+	if (fpin)
+		fput(fpin);
+	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH][v6] filemap: drop the mmap_sem for all blocking operations
  2018-12-12 15:27   ` [PATCH][v6] " Josef Bacik
@ 2018-12-12 23:55     ` Andrew Morton
  2018-12-13 16:01       ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2018-12-12 23:55 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, david, linux-fsdevel,
	linux-mm, riel, jack

On Wed, 12 Dec 2018 10:27:57 -0500 Josef Bacik <josef@toxicpanda.com> wrote:

> v5->v6:
> - added more comments as per Andrew's suggestion.
> - fixed the fpin leaks in the two error paths that were pointed out.
> 

hm,

> --- a/mm/filemap.c~filemap-drop-the-mmap_sem-for-all-blocking-operations-v6
> +++ a/mm/filemap.c
> @@ -2461,7 +2476,8 @@ static struct file *do_sync_mmap_readahe
>  
>  /*
>   * Asynchronous readahead happens when we find the page and PG_readahead,
> - * so we want to possibly extend the readahead further..
> + * so we want to possibly extend the readahead further.  We return the file that
> + * was pinned if we have to drop the mmap_sem in order to do IO.
>   */
>  static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>  					    struct page *page)
> @@ -2545,14 +2561,15 @@ retry_find:
>  		page = pagecache_get_page(mapping, offset,
>  					  FGP_CREAT|FGP_FOR_MMAP,
>  					  vmf->gfp_mask);
> -		if (!page)
> +		if (!page) {
> +			if (fpin)
> +				goto out_retry;

Is this right?  If pagecache_get_page() returns NULL we can now return
VM_FAULT_MAJOR|VM_FAULT_RETRY whereas we used to return ENOMEM.

>  			return vmf_error(-ENOMEM);
> +		}
>  	}
>  
> -	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin)) {
> -		put_page(page);
> -		return ret | VM_FAULT_RETRY;
> -	}
> +	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
> +		goto out_retry;
>  
>  	/* Did it get truncated? */
>  	if (unlikely(page->mapping != mapping)) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][v6] filemap: drop the mmap_sem for all blocking operations
  2018-12-12 23:55     ` Andrew Morton
@ 2018-12-13 16:01       ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2018-12-13 16:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josef Bacik, kernel-team, hannes, linux-kernel, tj, david,
	linux-fsdevel, linux-mm, riel, jack

On Wed 12-12-18 15:55:36, Andrew Morton wrote:
> On Wed, 12 Dec 2018 10:27:57 -0500 Josef Bacik <josef@toxicpanda.com> wrote:
> 
> > v5->v6:
> > - added more comments as per Andrew's suggestion.
> > - fixed the fpin leaks in the two error paths that were pointed out.
> > 
> 
> hm,
> 
> > --- a/mm/filemap.c~filemap-drop-the-mmap_sem-for-all-blocking-operations-v6
> > +++ a/mm/filemap.c
> > @@ -2461,7 +2476,8 @@ static struct file *do_sync_mmap_readahe
> >  
> >  /*
> >   * Asynchronous readahead happens when we find the page and PG_readahead,
> > - * so we want to possibly extend the readahead further..
> > + * so we want to possibly extend the readahead further.  We return the file that
> > + * was pinned if we have to drop the mmap_sem in order to do IO.
> >   */
> >  static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> >  					    struct page *page)
> > @@ -2545,14 +2561,15 @@ retry_find:
> >  		page = pagecache_get_page(mapping, offset,
> >  					  FGP_CREAT|FGP_FOR_MMAP,
> >  					  vmf->gfp_mask);
> > -		if (!page)
> > +		if (!page) {
> > +			if (fpin)
> > +				goto out_retry;
> 
> Is this right?  If pagecache_get_page() returns NULL we can now return
> VM_FAULT_MAJOR|VM_FAULT_RETRY whereas we used to return ENOMEM.

Yes, but once we've dropped mmap_sem, there's no way safely return -ENOMEM.
So VM_FAULT_RETRY is really the only option to tell the caller that
mmap_sem is not held anymore...

So the patch looks good to me now. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>


								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-12-13 16:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 17:37 [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Josef Bacik
2018-12-11 17:37 ` [PATCH 1/3] filemap: kill page_cache_read usage in filemap_fault Josef Bacik
2018-12-11 17:38 ` [PATCH 2/3] filemap: pass vm_fault to the mmap ra helpers Josef Bacik
2018-12-12 10:10   ` Jan Kara
2018-12-11 17:38 ` [PATCH 3/3] filemap: drop the mmap_sem for all blocking operations Josef Bacik
2018-12-11 21:15   ` Andrew Morton
2018-12-12 10:36     ` Jan Kara
2018-12-12 15:27   ` [PATCH][v6] " Josef Bacik
2018-12-12 23:55     ` Andrew Morton
2018-12-13 16:01       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).