linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path
@ 2018-10-18 20:23 Josef Bacik
  2018-10-18 20:23 ` [PATCH 1/7] mm: infrastructure for page fault page caching Josef Bacik
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

Getting some production testing running on these patches shortly to verify they
are ready for primetime, but in the meantime they've had a bunch of xfstests
runs on xfs, btrfs, and ext4 using kvm-xfstests.

v2->v3:
- dropped the RFC, ready for a real review.
- fixed a kbuild error for !MMU configs.
- dropped the swapcache patches since Johannes is still working on those parts.

v1->v2:
- reworked so it only affects x86, since its the only arch I can build and test.
- fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down
  to ->page_mkwrite.
- fixed error handling in do_page_mkwrite/callers to explicitly catch
  VM_FAULT_RETRY.
- fixed btrfs to set ->cached_page properly.

This time I've verified that the ->page_mkwrite retry path is actually getting
used (apparently I only verified the read side last time).  xfstests is still
running but it passed the couple of mmap tests I ran directly.  Again this is an
RFC, I'm still doing a bunch of testing, but I'd appreciate comments on the
overall strategy.

-- Original message --

Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions.  Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.

We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such.  This involves running 'ps' or other such
utilities.  These utilities will often walk /proc/<pid>/whatever, and these
files can sometimes need to down_read(&task->mmap_sem).  Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.

This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer.  This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal.  But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.

In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO.  This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock.  With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.  Let me
know what you think.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/7] mm: infrastructure for page fault page caching
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-18 20:23 ` [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission Josef Bacik
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

We want to be able to cache the result of a previous loop of a page
fault in the case that we use VM_FAULT_RETRY, so introduce
handle_mm_fault_cacheable that will take a struct vm_fault directly, add
a ->cached_page field to vm_fault, and add helpers to init/cleanup the
struct vm_fault.

I've converted x86, other arch's can follow suit if they so wish, it's
relatively straightforward.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 arch/x86/mm/fault.c |  6 +++-
 include/linux/mm.h  | 31 +++++++++++++++++++++
 mm/memory.c         | 79 ++++++++++++++++++++++++++++++++---------------------
 3 files changed, 84 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 47bebfe6efa7..ef6e538c4931 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1211,6 +1211,7 @@ static noinline void
 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
 {
+	struct vm_fault vmf = {};
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct mm_struct *mm;
@@ -1392,7 +1393,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * fault, so we read the pkey beforehand.
 	 */
 	pkey = vma_pkey(vma);
-	fault = handle_mm_fault(vma, address, flags);
+	vm_fault_init(&vmf, vma, address, flags);
+	fault = handle_mm_fault_cacheable(&vmf);
 	major |= fault & VM_FAULT_MAJOR;
 
 	/*
@@ -1408,6 +1410,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 			if (!fatal_signal_pending(tsk))
 				goto retry;
 		}
+		vm_fault_cleanup(&vmf);
 
 		/* User mode? Just return to handle the fatal exception */
 		if (flags & FAULT_FLAG_USER)
@@ -1418,6 +1421,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+	vm_fault_cleanup(&vmf);
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, &pkey, fault);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..4a84ec976dfc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -360,6 +360,12 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+	struct page *cached_page;	/* ->fault handlers that return
+					 * VM_FAULT_RETRY can store their
+					 * previous page here to be reused the
+					 * next time we loop through the fault
+					 * handler for faster lookup.
+					 */
 	/* These three entries are valid only while holding ptl lock */
 	pte_t *pte;			/* Pointer to pte entry matching
 					 * the 'address'. NULL if the page
@@ -378,6 +384,16 @@ struct vm_fault {
 					 */
 };
 
+static inline void vm_fault_init(struct vm_fault *vmf,
+				 struct vm_area_struct *vma,
+				 unsigned long address,
+				 unsigned int flags)
+{
+	vmf->vma = vma;
+	vmf->address = address;
+	vmf->flags = flags;
+}
+
 /* page entry size for vm->huge_fault() */
 enum page_entry_size {
 	PE_SIZE_PTE = 0,
@@ -943,6 +959,14 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+static inline void vm_fault_cleanup(struct vm_fault *vmf)
+{
+	if (vmf->cached_page) {
+		put_page(vmf->cached_page);
+		vmf->cached_page = NULL;
+	}
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1405,6 +1429,7 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+extern vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
@@ -1420,6 +1445,12 @@ static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 	BUG();
 	return VM_FAULT_SIGBUS;
 }
+static inline vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf)
+{
+	/* should never happen if there's no MMU */
+	BUG();
+	return VM_FAULT_SIGBUS;
+}
 static inline int fixup_user_fault(struct task_struct *tsk,
 		struct mm_struct *mm, unsigned long address,
 		unsigned int fault_flags, bool *unlocked)
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..433075f722ea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4024,36 +4024,34 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
-		unsigned long address, unsigned int flags)
+static vm_fault_t __handle_mm_fault(struct vm_fault *vmf)
 {
-	struct vm_fault vmf = {
-		.vma = vma,
-		.address = address & PAGE_MASK,
-		.flags = flags,
-		.pgoff = linear_page_index(vma, address),
-		.gfp_mask = __get_fault_gfp_mask(vma),
-	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long address = vmf->address;
+	unsigned int dirty = vmf->flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	vm_fault_t ret;
 
+	vmf->address = address & PAGE_MASK;
+	vmf->pgoff = linear_page_index(vma, address);
+	vmf->gfp_mask = __get_fault_gfp_mask(vma);
+
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
 	if (!p4d)
 		return VM_FAULT_OOM;
 
-	vmf.pud = pud_alloc(mm, p4d, address);
-	if (!vmf.pud)
+	vmf->pud = pud_alloc(mm, p4d, address);
+	if (!vmf->pud)
 		return VM_FAULT_OOM;
-	if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
-		ret = create_huge_pud(&vmf);
+	if (pud_none(*vmf->pud) && transparent_hugepage_enabled(vma)) {
+		ret = create_huge_pud(vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pud_t orig_pud = *vmf.pud;
+		pud_t orig_pud = *vmf->pud;
 
 		barrier();
 		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
@@ -4061,50 +4059,50 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			/* NUMA case for anonymous PUDs would go here */
 
 			if (dirty && !pud_write(orig_pud)) {
-				ret = wp_huge_pud(&vmf, orig_pud);
+				ret = wp_huge_pud(vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pud_set_accessed(&vmf, orig_pud);
+				huge_pud_set_accessed(vmf, orig_pud);
 				return 0;
 			}
 		}
 	}
 
-	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
-	if (!vmf.pmd)
+	vmf->pmd = pmd_alloc(mm, vmf->pud, address);
+	if (!vmf->pmd)
 		return VM_FAULT_OOM;
-	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
-		ret = create_huge_pmd(&vmf);
+	if (pmd_none(*vmf->pmd) && transparent_hugepage_enabled(vma)) {
+		ret = create_huge_pmd(vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pmd_t orig_pmd = *vmf.pmd;
+		pmd_t orig_pmd = *vmf->pmd;
 
 		barrier();
 		if (unlikely(is_swap_pmd(orig_pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
 					  !is_pmd_migration_entry(orig_pmd));
 			if (is_pmd_migration_entry(orig_pmd))
-				pmd_migration_entry_wait(mm, vmf.pmd);
+				pmd_migration_entry_wait(mm, vmf->pmd);
 			return 0;
 		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
-				return do_huge_pmd_numa_page(&vmf, orig_pmd);
+				return do_huge_pmd_numa_page(vmf, orig_pmd);
 
 			if (dirty && !pmd_write(orig_pmd)) {
-				ret = wp_huge_pmd(&vmf, orig_pmd);
+				ret = wp_huge_pmd(vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pmd_set_accessed(&vmf, orig_pmd);
+				huge_pmd_set_accessed(vmf, orig_pmd);
 				return 0;
 			}
 		}
 	}
 
-	return handle_pte_fault(&vmf);
+	return handle_pte_fault(vmf);
 }
 
 /*
@@ -4113,9 +4111,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
-		unsigned int flags)
+static vm_fault_t do_handle_mm_fault(struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned int flags = vmf->flags;
 	vm_fault_t ret;
 
 	__set_current_state(TASK_RUNNING);
@@ -4139,9 +4138,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		mem_cgroup_enter_user_fault();
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
-		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+		ret = hugetlb_fault(vma->vm_mm, vma, vmf->address, flags);
 	else
-		ret = __handle_mm_fault(vma, address, flags);
+		ret = __handle_mm_fault(vmf);
 
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
@@ -4157,8 +4156,26 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	return ret;
 }
+
+vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+			   unsigned int flags)
+{
+	struct vm_fault vmf = {};
+	vm_fault_t ret;
+
+	vm_fault_init(&vmf, vma, address, flags);
+	ret = do_handle_mm_fault(&vmf);
+	vm_fault_cleanup(&vmf);
+	return ret;
+}
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf)
+{
+	return do_handle_mm_fault(vmf);
+}
+EXPORT_SYMBOL_GPL(handle_mm_fault_cacheable);
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
  2018-10-18 20:23 ` [PATCH 1/7] mm: infrastructure for page fault page caching Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:14   ` Dave Chinner
  2018-10-18 20:23 ` [PATCH 3/7] mm: drop the mmap_sem in all read fault cases Josef Bacik
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

From: Johannes Weiner <hannes@cmpxchg.org>

Reads can take a long time, and if anybody needs to take a write lock on
the mmap_sem it'll block any subsequent readers to the mmap_sem while
the read is outstanding, which could cause long delays.  Instead drop
the mmap_sem if we do any reads at all.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 119 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 90 insertions(+), 29 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..1ed35cd99b2c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2366,6 +2366,18 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
+static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int flags)
+{
+	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == FAULT_FLAG_ALLOW_RETRY) {
+		struct file *file;
+
+		file = get_file(vma->vm_file);
+		up_read(&vma->vm_mm->mmap_sem);
+		return file;
+	}
+	return NULL;
+}
+
 /**
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
@@ -2405,23 +2417,28 @@ static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
  * Synchronous readahead happens when we don't even find
  * a page in the page cache at all.
  */
-static void do_sync_mmap_readahead(struct vm_area_struct *vma,
-				   struct file_ra_state *ra,
-				   struct file *file,
-				   pgoff_t offset)
+static int do_sync_mmap_readahead(struct vm_area_struct *vma,
+				  struct file_ra_state *ra,
+				  struct file *file,
+				  pgoff_t offset,
+				  int flags)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vma->vm_flags & VM_RAND_READ)
-		return;
+		return 0;
 	if (!ra->ra_pages)
-		return;
+		return 0;
 
 	if (vma->vm_flags & VM_SEQ_READ) {
+		fpin = maybe_unlock_mmap_for_io(vma, flags);
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
-		return;
+		if (fpin)
+			fput(fpin);
+		return fpin ? -EAGAIN : 0;
 	}
 
 	/* Avoid banging the cache line if not needed */
@@ -2433,7 +2450,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
-		return;
+		return 0;
+
+	fpin = maybe_unlock_mmap_for_io(vma, flags);
 
 	/*
 	 * mmap read-around
@@ -2442,28 +2461,40 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
+
+	if (fpin)
+		fput(fpin);
+
+	return fpin ? -EAGAIN : 0;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
  * so we want to possibly extend the readahead further..
  */
-static void do_async_mmap_readahead(struct vm_area_struct *vma,
-				    struct file_ra_state *ra,
-				    struct file *file,
-				    struct page *page,
-				    pgoff_t offset)
+static int do_async_mmap_readahead(struct vm_area_struct *vma,
+				   struct file_ra_state *ra,
+				   struct file *file,
+				   struct page *page,
+				   pgoff_t offset,
+				   int flags)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vma->vm_flags & VM_RAND_READ)
-		return;
+		return 0;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
-		page_cache_async_readahead(mapping, ra, file,
-					   page, offset, ra->ra_pages);
+	if (!PageReadahead(page))
+		return 0;
+	fpin = maybe_unlock_mmap_for_io(vma, flags);
+	page_cache_async_readahead(mapping, ra, file,
+				   page, offset, ra->ra_pages);
+	if (fpin)
+		fput(fpin);
+	return fpin ? -EAGAIN : 0;
 }
 
 /**
@@ -2479,10 +2510,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
  *
  * vma->vm_mm->mmap_sem must be held on entry.
  *
- * If our return value has VM_FAULT_RETRY set, it's because
- * lock_page_or_retry() returned 0.
- * The mmap_sem has usually been released in this case.
- * See __lock_page_or_retry() for the exception.
+ * If our return value has VM_FAULT_RETRY set, the mmap_sem has
+ * usually been released.
  *
  * If our return value does not have VM_FAULT_RETRY set, the mmap_sem
  * has not been released.
@@ -2492,11 +2521,13 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
 vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	int error;
+	struct mm_struct *mm = vmf->vma->vm_mm;
 	struct file *file = vmf->vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	int flags = vmf->flags;
 	pgoff_t max_off;
 	struct page *page;
 	vm_fault_t ret = 0;
@@ -2509,27 +2540,44 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * Do we have something in the page cache already?
 	 */
 	page = find_get_page(mapping, offset);
-	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+	if (likely(page) && !(flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+		error = do_async_mmap_readahead(vmf->vma, ra, file, page, offset, vmf->flags);
+		if (error == -EAGAIN)
+			goto out_retry_wait;
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
-		count_vm_event(PGMAJFAULT);
-		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(mm, PGMAJFAULT);
+		error = do_sync_mmap_readahead(vmf->vma, ra, file, offset, vmf->flags);
+		if (error == -EAGAIN)
+			goto out_retry_wait;
 retry_find:
 		page = find_get_page(mapping, offset);
 		if (!page)
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
-		put_page(page);
-		return ret | VM_FAULT_RETRY;
+	if (!trylock_page(page)) {
+		if (flags & FAULT_FLAG_ALLOW_RETRY) {
+			if (flags & FAULT_FLAG_RETRY_NOWAIT)
+				goto out_retry;
+			up_read(&mm->mmap_sem);
+			goto out_retry_wait;
+		}
+		if (flags & FAULT_FLAG_KILLABLE) {
+			int ret = __lock_page_killable(page);
+
+			if (ret) {
+				up_read(&mm->mmap_sem);
+				goto out_retry;
+			}
+		} else
+			__lock_page(page);
 	}
 
 	/* Did it get truncated? */
@@ -2607,6 +2655,19 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
+
+out_retry_wait:
+	if (page) {
+		if (flags & FAULT_FLAG_KILLABLE)
+			wait_on_page_locked_killable(page);
+		else
+			wait_on_page_locked(page);
+	}
+
+out_retry:
+	if (page)
+		put_page(page);
+	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/7] mm: drop the mmap_sem in all read fault cases
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
  2018-10-18 20:23 ` [PATCH 1/7] mm: infrastructure for page fault page caching Josef Bacik
  2018-10-18 20:23 ` [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:21   ` Dave Chinner
  2018-10-18 20:23 ` [PATCH 4/7] mm: use the cached page for filemap_fault Josef Bacik
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

Johannes' patches didn't quite cover all of the IO cases that we need to
drop the mmap_sem for, this patch covers the rest of them.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1ed35cd99b2c..65395ee132a0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2523,6 +2523,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	int error;
 	struct mm_struct *mm = vmf->vma->vm_mm;
 	struct file *file = vmf->vma->vm_file;
+	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
@@ -2610,11 +2611,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	return ret | VM_FAULT_LOCKED;
 
 no_cached_page:
+	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
+
 	/*
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
 	error = page_cache_read(file, offset, vmf->gfp_mask);
+	if (fpin)
+		goto out_retry;
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -2634,6 +2639,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 
 page_not_uptodate:
+	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
+
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.
 	 * Try to re-read it _once_. We do this synchronously,
@@ -2647,6 +2654,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (!PageUptodate(page))
 			error = -EIO;
 	}
+	if (fpin)
+		goto out_retry;
 	put_page(page);
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2665,6 +2674,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	}
 
 out_retry:
+	if (fpin)
+		fput(fpin);
 	if (page)
 		put_page(page);
 	return ret | VM_FAULT_RETRY;
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/7] mm: use the cached page for filemap_fault
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
                   ` (2 preceding siblings ...)
  2018-10-18 20:23 ` [PATCH 3/7] mm: drop the mmap_sem in all read fault cases Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:27   ` Dave Chinner
  2018-10-18 20:23 ` [PATCH 5/7] mm: add a flag to indicate we used a cached page Josef Bacik
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

If we drop the mmap_sem we have to redo the vma lookup which requires
redoing the fault handler.  Chances are we will just come back to the
same page, so save this page in our vmf->cached_page and reuse it in the
next loop through the fault handler.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 65395ee132a0..5212ab637832 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2530,13 +2530,38 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	pgoff_t offset = vmf->pgoff;
 	int flags = vmf->flags;
 	pgoff_t max_off;
-	struct page *page;
+	struct page *page = NULL;
+	struct page *cached_page = vmf->cached_page;
 	vm_fault_t ret = 0;
 
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 	if (unlikely(offset >= max_off))
 		return VM_FAULT_SIGBUS;
 
+	/*
+	 * We may have read in the page already and have a page from an earlier
+	 * loop.  If so we need to see if this page is still valid, and if not
+	 * do the whole dance over again.
+	 */
+	if (cached_page) {
+		if (flags & FAULT_FLAG_KILLABLE) {
+			error = lock_page_killable(cached_page);
+			if (error) {
+				up_read(&mm->mmap_sem);
+				goto out_retry;
+			}
+		} else
+			lock_page(cached_page);
+		vmf->cached_page = NULL;
+		if (cached_page->mapping == mapping &&
+		    cached_page->index == offset) {
+			page = cached_page;
+			goto have_cached_page;
+		}
+		unlock_page(cached_page);
+		put_page(cached_page);
+	}
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
@@ -2587,6 +2612,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		put_page(page);
 		goto retry_find;
 	}
+have_cached_page:
 	VM_BUG_ON_PAGE(page->index != offset, page);
 
 	/*
@@ -2677,7 +2703,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (fpin)
 		fput(fpin);
 	if (page)
-		put_page(page);
+		vmf->cached_page = page;
 	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/7] mm: add a flag to indicate we used a cached page
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
                   ` (3 preceding siblings ...)
  2018-10-18 20:23 ` [PATCH 4/7] mm: use the cached page for filemap_fault Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:34   ` Dave Chinner
  2018-10-18 20:23 ` [PATCH 6/7] mm: allow ->page_mkwrite to do retries Josef Bacik
  2018-10-18 20:23 ` [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

This is preparation for dropping the mmap_sem in page_mkwrite.  We need
to know if we used our cached page so we can be sure it is the page we
already did the page_mkwrite stuff on so we don't have to redo all of
that work.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/mm.h | 6 +++++-
 mm/filemap.c       | 5 ++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4a84ec976dfc..a7305d193c71 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -318,6 +318,9 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_USED_CACHED	0x200	/* Our vmf->page was from a previous
+					 * loop through the fault handler.
+					 */
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
@@ -328,7 +331,8 @@ extern pgprot_t protection_map[16];
 	{ FAULT_FLAG_TRIED,		"TRIED" }, \
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
-	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }
+	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
+	{ FAULT_FLAG_USED_CACHED,	"USED_CACHED" }
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/filemap.c b/mm/filemap.c
index 5212ab637832..e9cb44bd35aa 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2556,6 +2556,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (cached_page->mapping == mapping &&
 		    cached_page->index == offset) {
 			page = cached_page;
+			vmf->flags |= FAULT_FLAG_USED_CACHED;
 			goto have_cached_page;
 		}
 		unlock_page(cached_page);
@@ -2619,8 +2620,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * We have a locked page in the page cache, now we need to check
 	 * that it's up-to-date. If not, it is going to be due to an error.
 	 */
-	if (unlikely(!PageUptodate(page)))
+	if (unlikely(!PageUptodate(page))) {
+		vmf->flags &= ~(FAULT_FLAG_USED_CACHED);
 		goto page_not_uptodate;
+	}
 
 	/*
 	 * Found the page and have a reference on it.
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/7] mm: allow ->page_mkwrite to do retries
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
                   ` (4 preceding siblings ...)
  2018-10-18 20:23 ` [PATCH 5/7] mm: add a flag to indicate we used a cached page Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:36   ` Dave Chinner
  2018-10-18 20:23 ` [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

Before we didn't set the retry flag on our vm_fault.  We want to allow
file systems to drop the mmap_sem if they so choose, so set this flag
and deal with VM_FAULT_RETRY appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/memory.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 433075f722ea..c5e81edd94f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2384,11 +2384,13 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 	unsigned int old_flags = vmf->flags;
 
 	vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf->flags |= old_flags & FAULT_FLAG_ALLOW_RETRY;
 
 	ret = vmf->vma->vm_ops->page_mkwrite(vmf);
 	/* Restore original flags so that caller is not surprised */
 	vmf->flags = old_flags;
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 	if (unlikely(!(ret & VM_FAULT_LOCKED))) {
 		lock_page(page);
@@ -2683,7 +2685,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		tmp = do_page_mkwrite(vmf);
 		if (unlikely(!tmp || (tmp &
-				      (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
+				      (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+				       VM_FAULT_RETRY)))) {
 			put_page(vmf->page);
 			return tmp;
 		}
@@ -3716,7 +3719,8 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 		unlock_page(vmf->page);
 		tmp = do_page_mkwrite(vmf);
 		if (unlikely(!tmp ||
-				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
+				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+					VM_FAULT_RETRY)))) {
 			put_page(vmf->page);
 			return tmp;
 		}
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
                   ` (5 preceding siblings ...)
  2018-10-18 20:23 ` [PATCH 6/7] mm: allow ->page_mkwrite to do retries Josef Bacik
@ 2018-10-18 20:23 ` Josef Bacik
  2018-10-19  3:48   ` Dave Chinner
  2018-10-25 13:22   ` Jan Kara
  6 siblings, 2 replies; 19+ messages in thread
From: Josef Bacik @ 2018-10-18 20:23 UTC (permalink / raw)
  To: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

->page_mkwrite is extremely expensive in btrfs.  We have to reserve
space, which can take 6 lifetimes, and we could possibly have to wait on
writeback on the page, another several lifetimes.  To avoid this simply
drop the mmap_sem if we didn't have the cached page and do all of our
work and return the appropriate retry error.  If we have the cached page
we know we did all the right things to set this page up and we can just
carry on.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++--
 include/linux/mm.h | 14 ++++++++++++++
 mm/filemap.c       |  3 ++-
 3 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..6b723d29bc0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8809,7 +8809,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
-	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct file *file = vmf->vma->vm_file, *fpin;
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_ordered_extent *ordered;
@@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 
 	reserved_space = PAGE_SIZE;
 
+	/*
+	 * We have our cached page from a previous mkwrite, check it to make
+	 * sure it's still dirty and our file size matches when we ran mkwrite
+	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
+	 * otherwise do the mkwrite again.
+	 */
+	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
+		lock_page(page);
+		if (vmf->cached_size == i_size_read(inode) &&
+		    PageDirty(page))
+			return VM_FAULT_LOCKED;
+		unlock_page(page);
+	}
+
+	/*
+	 * mkwrite is extremely expensive, and we are holding the mmap_sem
+	 * during this, which means we can starve out anybody trying to
+	 * down_write(mmap_sem) for a long while, especially if we throw cgroups
+	 * into the mix.  So just drop the mmap_sem and do all of our work,
+	 * we'll loop back through and verify everything is ok the next time and
+	 * hopefully avoid doing the work twice.
+	 */
+	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
 	sb_start_pagefault(inode->i_sb);
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
@@ -8844,7 +8869,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 	ret2 = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
 					   reserved_space);
 	if (!ret2) {
-		ret2 = file_update_time(vmf->vma->vm_file);
+		ret2 = file_update_time(file);
 		reserved = 1;
 	}
 	if (ret2) {
@@ -8943,6 +8968,14 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true);
 		sb_end_pagefault(inode->i_sb);
 		extent_changeset_free(data_reserved);
+		if (fpin) {
+			unlock_page(page);
+			fput(fpin);
+			get_page(page);
+			vmf->cached_size = size;
+			vmf->cached_page = page;
+			return VM_FAULT_RETRY;
+		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -8955,6 +8988,10 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	extent_changeset_free(data_reserved);
+	if (fpin) {
+		fput(fpin);
+		down_read(&mm->mmap_sem);
+	}
 	return ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7305d193c71..02b420be6b06 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,6 +370,13 @@ struct vm_fault {
 					 * next time we loop through the fault
 					 * handler for faster lookup.
 					 */
+	loff_t cached_size;		/* ->page_mkwrite handlers may drop
+					 * the mmap_sem to avoid starvation, in
+					 * which case they need to save the
+					 * i_size in order to verify the cached
+					 * page we're using the next loop
+					 * through hasn't changed under us.
+					 */
 	/* These three entries are valid only while holding ptl lock */
 	pte_t *pte;			/* Pointer to pte entry matching
 					 * the 'address'. NULL if the page
@@ -1437,6 +1444,8 @@ extern vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+extern struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma,
+					     int flags);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1463,6 +1472,11 @@ static inline int fixup_user_fault(struct task_struct *tsk,
 	BUG();
 	return -EFAULT;
 }
+static inline struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma,
+						    int flags)
+{
+	return NULL;
+}
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index e9cb44bd35aa..8027f082d74f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2366,7 +2366,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
-static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int flags)
+struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int flags)
 {
 	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == FAULT_FLAG_ALLOW_RETRY) {
 		struct file *file;
@@ -2377,6 +2377,7 @@ static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int fla
 	}
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(maybe_unlock_mmap_for_io);
 
 /**
  * page_cache_read - adds requested page to the page cache if not already there
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission
  2018-10-18 20:23 ` [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission Josef Bacik
@ 2018-10-19  3:14   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:13PM -0400, Josef Bacik wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reads can take a long time, and if anybody needs to take a write lock on
> the mmap_sem it'll block any subsequent readers to the mmap_sem while
> the read is outstanding, which could cause long delays.  Instead drop
> the mmap_sem if we do any reads at all.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
.....
>  vm_fault_t filemap_fault(struct vm_fault *vmf)
>  {
>  	int error;
> +	struct mm_struct *mm = vmf->vma->vm_mm;
>  	struct file *file = vmf->vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct inode *inode = mapping->host;
>  	pgoff_t offset = vmf->pgoff;
> +	int flags = vmf->flags;

local copy of flags.

>  	pgoff_t max_off;
>  	struct page *page;
>  	vm_fault_t ret = 0;
> @@ -2509,27 +2540,44 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 * Do we have something in the page cache already?
>  	 */
>  	page = find_get_page(mapping, offset);
> -	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
> +	if (likely(page) && !(flags & FAULT_FLAG_TRIED)) {

Used here.

>  		/*
>  		 * We found the page, so try async readahead before
>  		 * waiting for the lock.
>  		 */
> -		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
> +		error = do_async_mmap_readahead(vmf->vma, ra, file, page, offset, vmf->flags);

Not here.

> +		if (error == -EAGAIN)
> +			goto out_retry_wait;
>  	} else if (!page) {
>  		/* No page in the page cache at all */
> -		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
> -		count_vm_event(PGMAJFAULT);
> -		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>  		ret = VM_FAULT_MAJOR;
> +		count_vm_event(PGMAJFAULT);
> +		count_memcg_event_mm(mm, PGMAJFAULT);
> +		error = do_sync_mmap_readahead(vmf->vma, ra, file, offset, vmf->flags);

or here.

(Also, the vmf is passed through to where these flags
are used, so why is it passed as a separate flag parameter?)

> +		if (error == -EAGAIN)
> +			goto out_retry_wait;
>  retry_find:
>  		page = find_get_page(mapping, offset);
>  		if (!page)
>  			goto no_cached_page;
>  	}
>  
> -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> -		put_page(page);
> -		return ret | VM_FAULT_RETRY;
> +	if (!trylock_page(page)) {
> +		if (flags & FAULT_FLAG_ALLOW_RETRY) {
> +			if (flags & FAULT_FLAG_RETRY_NOWAIT)
> +				goto out_retry;
> +			up_read(&mm->mmap_sem);
> +			goto out_retry_wait;
> +		}
> +		if (flags & FAULT_FLAG_KILLABLE) {

but is used here...

> +			int ret = __lock_page_killable(page);
> +
> +			if (ret) {
> +				up_read(&mm->mmap_sem);
> +				goto out_retry;
> +			}
> +		} else
> +			__lock_page(page);
>  	}
>  
>  	/* Did it get truncated? */
> @@ -2607,6 +2655,19 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	/* Things didn't work out. Return zero to tell the mm layer so. */
>  	shrink_readahead_size_eio(file, ra);
>  	return VM_FAULT_SIGBUS;
> +
> +out_retry_wait:
> +	if (page) {
> +		if (flags & FAULT_FLAG_KILLABLE)

and here.

Any reason for this discrepancy?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/7] mm: drop the mmap_sem in all read fault cases
  2018-10-18 20:23 ` [PATCH 3/7] mm: drop the mmap_sem in all read fault cases Josef Bacik
@ 2018-10-19  3:21   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:21 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:14PM -0400, Josef Bacik wrote:
> Johannes' patches didn't quite cover all of the IO cases that we need to
> drop the mmap_sem for, this patch covers the rest of them.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mm/filemap.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1ed35cd99b2c..65395ee132a0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2523,6 +2523,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	int error;
>  	struct mm_struct *mm = vmf->vma->vm_mm;
>  	struct file *file = vmf->vma->vm_file;
> +	struct file *fpin = NULL;
>  	struct address_space *mapping = file->f_mapping;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct inode *inode = mapping->host;
> @@ -2610,11 +2611,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	return ret | VM_FAULT_LOCKED;
>  
>  no_cached_page:
> +	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
> +
>  	/*
>  	 * We're only likely to ever get here if MADV_RANDOM is in
>  	 * effect.
>  	 */
>  	error = page_cache_read(file, offset, vmf->gfp_mask);
> +	if (fpin)
> +		goto out_retry;

Please put the unlock after the comment explaining the goto label
so it's clear that the pin is associated only with the read
operations like so:

no_cached_page:
	/*
	 * We're only likely to ever get here if MADV_RANDOM is in
	 * effect.
	 */
	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
	error = page_cache_read(file, offset, vmf->gfp_mask);
	if (fpin)
		goto out_retry;
>  
>  	/*
>  	 * The page we want has now been added to the page cache.
> @@ -2634,6 +2639,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	return VM_FAULT_SIGBUS;
>  
>  page_not_uptodate:
> +	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
> +
>  	/*
>  	 * Umm, take care of errors if the page isn't up-to-date.
>  	 * Try to re-read it _once_. We do this synchronously,

Same here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/7] mm: use the cached page for filemap_fault
  2018-10-18 20:23 ` [PATCH 4/7] mm: use the cached page for filemap_fault Josef Bacik
@ 2018-10-19  3:27   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:27 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:15PM -0400, Josef Bacik wrote:
> If we drop the mmap_sem we have to redo the vma lookup which requires
> redoing the fault handler.  Chances are we will just come back to the
> same page, so save this page in our vmf->cached_page and reuse it in the
> next loop through the fault handler.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mm/filemap.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 65395ee132a0..5212ab637832 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2530,13 +2530,38 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	pgoff_t offset = vmf->pgoff;
>  	int flags = vmf->flags;
>  	pgoff_t max_off;
> -	struct page *page;
> +	struct page *page = NULL;
> +	struct page *cached_page = vmf->cached_page;
>  	vm_fault_t ret = 0;
>  
>  	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
>  	if (unlikely(offset >= max_off))
>  		return VM_FAULT_SIGBUS;
>  
> +	/*
> +	 * We may have read in the page already and have a page from an earlier
> +	 * loop.  If so we need to see if this page is still valid, and if not
> +	 * do the whole dance over again.
> +	 */
> +	if (cached_page) {
> +		if (flags & FAULT_FLAG_KILLABLE) {
> +			error = lock_page_killable(cached_page);
> +			if (error) {
> +				up_read(&mm->mmap_sem);
> +				goto out_retry;
> +			}
> +		} else
> +			lock_page(cached_page);
> +		vmf->cached_page = NULL;
> +		if (cached_page->mapping == mapping &&
> +		    cached_page->index == offset) {
> +			page = cached_page;
> +			goto have_cached_page;
> +		}
> +		unlock_page(cached_page);
> +		put_page(cached_page);
> +	}
> +

Can you factor this out so the main code path doesn't get any more
complex than it already is? i.e. something like:

	error = vmf_has_cached_page(vmf, &page);
	if (error)
		goto out_retry;
	if (page)
		goto have_cached_page;

-dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] mm: add a flag to indicate we used a cached page
  2018-10-18 20:23 ` [PATCH 5/7] mm: add a flag to indicate we used a cached page Josef Bacik
@ 2018-10-19  3:34   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:16PM -0400, Josef Bacik wrote:
> This is preparation for dropping the mmap_sem in page_mkwrite.  We need
> to know if we used our cached page so we can be sure it is the page we
> already did the page_mkwrite stuff on so we don't have to redo all of
> that work.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  include/linux/mm.h | 6 +++++-
>  mm/filemap.c       | 5 ++++-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4a84ec976dfc..a7305d193c71 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -318,6 +318,9 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> +#define FAULT_FLAG_USED_CACHED	0x200	/* Our vmf->page was from a previous
> +					 * loop through the fault handler.
> +					 */
>  
>  #define FAULT_FLAG_TRACE \
>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
> @@ -328,7 +331,8 @@ extern pgprot_t protection_map[16];
>  	{ FAULT_FLAG_TRIED,		"TRIED" }, \
>  	{ FAULT_FLAG_USER,		"USER" }, \
>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
> -	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }
> +	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
> +	{ FAULT_FLAG_USED_CACHED,	"USED_CACHED" }
>  
>  /*
>   * vm_fault is filled by the the pagefault handler and passed to the vma's
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5212ab637832..e9cb44bd35aa 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2556,6 +2556,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		if (cached_page->mapping == mapping &&
>  		    cached_page->index == offset) {
>  			page = cached_page;
> +			vmf->flags |= FAULT_FLAG_USED_CACHED;

This is really saying the page has been initialised by a prior
fault attempt, not that "we used a cached page". "cached page" is a
horribly overloaded term - I suspect we should not overload it more,
especially as the flag get cleared if the cached page is not up to
date (i.e. the data on it hasn't been fully initialised).

FAULT_FLAG_PAGE_INITIALISED?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7] mm: allow ->page_mkwrite to do retries
  2018-10-18 20:23 ` [PATCH 6/7] mm: allow ->page_mkwrite to do retries Josef Bacik
@ 2018-10-19  3:36   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:17PM -0400, Josef Bacik wrote:
> Before we didn't set the retry flag on our vm_fault.  We want to allow
> file systems to drop the mmap_sem if they so choose, so set this flag
> and deal with VM_FAULT_RETRY appropriately.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mm/memory.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 433075f722ea..c5e81edd94f9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2384,11 +2384,13 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
>  	unsigned int old_flags = vmf->flags;
>  
>  	vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +	vmf->flags |= old_flags & FAULT_FLAG_ALLOW_RETRY;
>  
>  	ret = vmf->vma->vm_ops->page_mkwrite(vmf);
>  	/* Restore original flags so that caller is not surprised */
>  	vmf->flags = old_flags;


> -	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
> +			    VM_FAULT_RETRY)))

Mess.

#define __FAIL_FLAGS	(VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)
	if (ret & __FAIL_FLAGS)

Should kill the unlikely() at the same time.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-18 20:23 ` [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
@ 2018-10-19  3:48   ` Dave Chinner
  2018-10-22 17:56     ` Josef Bacik
  2018-10-25 13:22   ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2018-10-19  3:48 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Thu, Oct 18, 2018 at 04:23:18PM -0400, Josef Bacik wrote:
> ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> space, which can take 6 lifetimes, and we could possibly have to wait on
> writeback on the page, another several lifetimes.  To avoid this simply
> drop the mmap_sem if we didn't have the cached page and do all of our
> work and return the appropriate retry error.  If we have the cached page
> we know we did all the right things to set this page up and we can just
> carry on.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/mm.h | 14 ++++++++++++++
>  mm/filemap.c       |  3 ++-
>  3 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3ea5339603cf..6b723d29bc0c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8809,7 +8809,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct page *page = vmf->page;
> -	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	struct file *file = vmf->vma->vm_file, *fpin;
> +	struct mm_struct *mm = vmf->vma->vm_mm;
> +	struct inode *inode = file_inode(file);
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>  	struct btrfs_ordered_extent *ordered;
> @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  
>  	reserved_space = PAGE_SIZE;
>  
> +	/*
> +	 * We have our cached page from a previous mkwrite, check it to make
> +	 * sure it's still dirty and our file size matches when we ran mkwrite
> +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> +	 * otherwise do the mkwrite again.
> +	 */
> +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> +		lock_page(page);
> +		if (vmf->cached_size == i_size_read(inode) &&
> +		    PageDirty(page))
> +			return VM_FAULT_LOCKED;
> +		unlock_page(page);
> +	}

What does the file size have to do with whether we can use the
initialised page or not? The file can be extended by other
data operations (like write()) while a page fault is in progress,
so I'm not sure how or why this check makes any sense.

I also don't see anything btrfs specific here, so....

> +	/*
> +	 * mkwrite is extremely expensive, and we are holding the mmap_sem
> +	 * during this, which means we can starve out anybody trying to
> +	 * down_write(mmap_sem) for a long while, especially if we throw cgroups
> +	 * into the mix.  So just drop the mmap_sem and do all of our work,
> +	 * we'll loop back through and verify everything is ok the next time and
> +	 * hopefully avoid doing the work twice.
> +	 */
> +	fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);

why can't all this be done by the caller of ->page_mkwrite?

>  	sb_start_pagefault(inode->i_sb);
>  	page_start = page_offset(page);
>  	page_end = page_start + PAGE_SIZE - 1;
> @@ -8844,7 +8869,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  	ret2 = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
>  					   reserved_space);
>  	if (!ret2) {
> -		ret2 = file_update_time(vmf->vma->vm_file);
> +		ret2 = file_update_time(file);
>  		reserved = 1;
>  	}
>  	if (ret2) {
> @@ -8943,6 +8968,14 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true);
>  		sb_end_pagefault(inode->i_sb);
>  		extent_changeset_free(data_reserved);
> +		if (fpin) {
> +			unlock_page(page);
> +			fput(fpin);
> +			get_page(page);
> +			vmf->cached_size = size;
> +			vmf->cached_page = page;
> +			return VM_FAULT_RETRY;
> +		}
>  		return VM_FAULT_LOCKED;

And this can all be done by the caller, too.

I'm not seeing anything btrfs sepcific here - maybe I'm missing
something, but this all looks likes it should be in the high level
mm/ code, not in the filesystem code.

>  	}
>  
> @@ -8955,6 +8988,10 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  out_noreserve:
>  	sb_end_pagefault(inode->i_sb);
>  	extent_changeset_free(data_reserved);
> +	if (fpin) {
> +		fput(fpin);
> +		down_read(&mm->mmap_sem);
> +	}
>  	return ret;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7305d193c71..02b420be6b06 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -370,6 +370,13 @@ struct vm_fault {
>  					 * next time we loop through the fault
>  					 * handler for faster lookup.
>  					 */
> +	loff_t cached_size;		/* ->page_mkwrite handlers may drop
> +					 * the mmap_sem to avoid starvation, in
> +					 * which case they need to save the
> +					 * i_size in order to verify the cached
> +					 * page we're using the next loop
> +					 * through hasn't changed under us.
> +					 */

You still haven't explained what the size verification is actually
required for.

>  	/* These three entries are valid only while holding ptl lock */
>  	pte_t *pte;			/* Pointer to pte entry matching
>  					 * the 'address'. NULL if the page
> @@ -1437,6 +1444,8 @@ extern vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf);
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>  			    unsigned long address, unsigned int fault_flags,
>  			    bool *unlocked);
> +extern struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma,
> +					     int flags);
>  void unmap_mapping_pages(struct address_space *mapping,
>  		pgoff_t start, pgoff_t nr, bool even_cows);
>  void unmap_mapping_range(struct address_space *mapping,
> @@ -1463,6 +1472,11 @@ static inline int fixup_user_fault(struct task_struct *tsk,
>  	BUG();
>  	return -EFAULT;
>  }
> +static inline struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma,
> +						    int flags)
> +{
> +	return NULL;
> +}
>  static inline void unmap_mapping_pages(struct address_space *mapping,
>  		pgoff_t start, pgoff_t nr, bool even_cows) { }
>  static inline void unmap_mapping_range(struct address_space *mapping,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e9cb44bd35aa..8027f082d74f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2366,7 +2366,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  EXPORT_SYMBOL(generic_file_read_iter);
>  
>  #ifdef CONFIG_MMU
> -static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int flags)
> +struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int flags)
>  {
>  	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == FAULT_FLAG_ALLOW_RETRY) {
>  		struct file *file;
> @@ -2377,6 +2377,7 @@ static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int fla
>  	}
>  	return NULL;
>  }
> +EXPORT_SYMBOL_GPL(maybe_unlock_mmap_for_io);

These API mods (if this functionality remains in the filesystem
code) belong in whatever patch introduced this function.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-19  3:48   ` Dave Chinner
@ 2018-10-22 17:56     ` Josef Bacik
  2018-10-22 21:31       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-22 17:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, kernel-team, hannes, linux-kernel, tj, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

On Fri, Oct 19, 2018 at 02:48:47PM +1100, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 04:23:18PM -0400, Josef Bacik wrote:
> > ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> > space, which can take 6 lifetimes, and we could possibly have to wait on
> > writeback on the page, another several lifetimes.  To avoid this simply
> > drop the mmap_sem if we didn't have the cached page and do all of our
> > work and return the appropriate retry error.  If we have the cached page
> > we know we did all the right things to set this page up and we can just
> > carry on.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++--
> >  include/linux/mm.h | 14 ++++++++++++++
> >  mm/filemap.c       |  3 ++-
> >  3 files changed, 55 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3ea5339603cf..6b723d29bc0c 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -8809,7 +8809,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
> >  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> >  {
> >  	struct page *page = vmf->page;
> > -	struct inode *inode = file_inode(vmf->vma->vm_file);
> > +	struct file *file = vmf->vma->vm_file, *fpin;
> > +	struct mm_struct *mm = vmf->vma->vm_mm;
> > +	struct inode *inode = file_inode(file);
> >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> >  	struct btrfs_ordered_extent *ordered;
> > @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> >  
> >  	reserved_space = PAGE_SIZE;
> >  
> > +	/*
> > +	 * We have our cached page from a previous mkwrite, check it to make
> > +	 * sure it's still dirty and our file size matches when we ran mkwrite
> > +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> > +	 * otherwise do the mkwrite again.
> > +	 */
> > +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> > +		lock_page(page);
> > +		if (vmf->cached_size == i_size_read(inode) &&
> > +		    PageDirty(page))
> > +			return VM_FAULT_LOCKED;
> > +		unlock_page(page);
> > +	}
> 
> What does the file size have to do with whether we can use the
> initialised page or not? The file can be extended by other
> data operations (like write()) while a page fault is in progress,
> so I'm not sure how or why this check makes any sense.
> 
> I also don't see anything btrfs specific here, so....
> 

First thanks for the review, I've gone through and addressed everything you
mentioned, however this one is subtle.

The problem is the vmf->vma->vm_file access.  Once we drop the mmap_sem we can
no longer safely go into vmf->vma, so I'd have to fix all the page_mkwrite()'s
to not touch vma, and add a vmf->fpin instead to mess with. Plus I didn't want
to miss some subtlety in other fs's page_mkwrite()'s and inavertedly break them.
If I break btrfs I can fix it, I'm not as good with xfs.

If you want this in the generic layer and not in the fs I can go back and add a
vmf->fpin and vmf->mm, and then fix up all the mkwrite()'s to use that instead
of vmf->vma.  I think that's the only things we care about so it wouldn't be
hard.  Does that sound reasonable to you?

Also the size thing was just a paranoid way to make sure everything had stayed
exactly the same, but you're right, I'll just keep it consistent with all of our
other "is this page ok" checks.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-22 17:56     ` Josef Bacik
@ 2018-10-22 21:31       ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-10-22 21:31 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, akpm, linux-fsdevel,
	linux-btrfs, riel, linux-mm

On Mon, Oct 22, 2018 at 01:56:54PM -0400, Josef Bacik wrote:
> On Fri, Oct 19, 2018 at 02:48:47PM +1100, Dave Chinner wrote:
> > On Thu, Oct 18, 2018 at 04:23:18PM -0400, Josef Bacik wrote:
> > > ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> > > space, which can take 6 lifetimes, and we could possibly have to wait on
> > > writeback on the page, another several lifetimes.  To avoid this simply
> > > drop the mmap_sem if we didn't have the cached page and do all of our
> > > work and return the appropriate retry error.  If we have the cached page
> > > we know we did all the right things to set this page up and we can just
> > > carry on.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > >  fs/btrfs/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++--
> > >  include/linux/mm.h | 14 ++++++++++++++
> > >  mm/filemap.c       |  3 ++-
> > >  3 files changed, 55 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 3ea5339603cf..6b723d29bc0c 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -8809,7 +8809,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
> > >  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> > >  {
> > >  	struct page *page = vmf->page;
> > > -	struct inode *inode = file_inode(vmf->vma->vm_file);
> > > +	struct file *file = vmf->vma->vm_file, *fpin;
> > > +	struct mm_struct *mm = vmf->vma->vm_mm;
> > > +	struct inode *inode = file_inode(file);
> > >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > >  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > >  	struct btrfs_ordered_extent *ordered;
> > > @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> > >  
> > >  	reserved_space = PAGE_SIZE;
> > >  
> > > +	/*
> > > +	 * We have our cached page from a previous mkwrite, check it to make
> > > +	 * sure it's still dirty and our file size matches when we ran mkwrite
> > > +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> > > +	 * otherwise do the mkwrite again.
> > > +	 */
> > > +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> > > +		lock_page(page);
> > > +		if (vmf->cached_size == i_size_read(inode) &&
> > > +		    PageDirty(page))
> > > +			return VM_FAULT_LOCKED;
> > > +		unlock_page(page);
> > > +	}
> > 
> > What does the file size have to do with whether we can use the
> > initialised page or not? The file can be extended by other
> > data operations (like write()) while a page fault is in progress,
> > so I'm not sure how or why this check makes any sense.
> > 
> > I also don't see anything btrfs specific here, so....
> > 
> 
> First thanks for the review, I've gone through and addressed everything you
> mentioned, however this one is subtle.
> 
> The problem is the vmf->vma->vm_file access.  Once we drop the mmap_sem we can
> no longer safely go into vmf->vma, so I'd have to fix all the page_mkwrite()'s
> to not touch vma, and add a vmf->fpin instead to mess with.

Adding a vmf->file pointer seems pretty easy - making /everything/
use it instead of just special casing page_mkwrite also seems like
it would be a good idea - set it up in your new init function and
it's done. Pinning and unpinning could be done unconditionally, too
- that doesn't look expensive - and it would pull a lot of the
complexity out of the patchset for the cases where unlocking the
mmap_sem is done....

> Plus I didn't want
> to miss some subtlety in other fs's page_mkwrite()'s and inavertedly break them.
> If I break btrfs I can fix it, I'm not as good with xfs.

Sure, but if you make it a generic helper then you don't have to
worry about that. i.e.

generic_page_mkwrite_nommapsem(vmf, page_mkwrite_cb)
{
	/* use cached page if valid */

	/* unlock mmap_sem */

	/* do filesystem page_mkwrite callback */
	ret = page_mkwrite_cb(vmf);

	/* handle page caching */

	/* return retry fault indicator */
}

and that allows all filesystems to easily opt in to dropping the
mmap_sem and you don't have to worry about it too much. It means
filesystems with dax fault paths (like XFS) can convert the non-dax
paths that call immediately, and worry about the more tricky stuff
later...

> vmf->fpin and vmf->mm, and then fix up all the mkwrite()'s to use that instead

Ignoring DAX (because I haven't checked), what filesystem
page_mkwrite implementation requires access to the mm_struct? They
should all just be doing file operations on a page, not anything to
do with the way the page is mapped into the task memory....

> of vmf->vma.  I think that's the only things we care about so it wouldn't be
> hard.  Does that sound reasonable to you?

I think a little bit of tweaking, and it will be easy to for
everyone to opt in to this behaviour....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-18 20:23 ` [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
  2018-10-19  3:48   ` Dave Chinner
@ 2018-10-25 13:22   ` Jan Kara
  2018-10-25 13:58     ` Josef Bacik
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2018-10-25 13:22 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

On Thu 18-10-18 16:23:18, Josef Bacik wrote:
> ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> space, which can take 6 lifetimes, and we could possibly have to wait on
> writeback on the page, another several lifetimes.  To avoid this simply
> drop the mmap_sem if we didn't have the cached page and do all of our
> work and return the appropriate retry error.  If we have the cached page
> we know we did all the right things to set this page up and we can just
> carry on.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
...
> @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  
>  	reserved_space = PAGE_SIZE;
>  
> +	/*
> +	 * We have our cached page from a previous mkwrite, check it to make
> +	 * sure it's still dirty and our file size matches when we ran mkwrite
> +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> +	 * otherwise do the mkwrite again.
> +	 */
> +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> +		lock_page(page);
> +		if (vmf->cached_size == i_size_read(inode) &&
> +		    PageDirty(page))
> +			return VM_FAULT_LOCKED;
> +		unlock_page(page);
> +	}

I guess this is similar to Dave's comment: Why is i_size so special? What
makes sure that file didn't get modified between time you've prepared
cached_page and now such that you need to do the preparation again?
And if indeed metadata prepared for a page cannot change, what's so special
about it being that particular cached_page?

Maybe to phrase my objections differently: Your preparations in
btrfs_page_mkwrite() are obviously related to your filesystem metadata. So
why cannot you infer from that metadata (extent tree, whatever - I'd use
extent status tree in ext4) whether that particular file+offset is already
prepared for writing and just bail out with success in that case?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-25 13:22   ` Jan Kara
@ 2018-10-25 13:58     ` Josef Bacik
  2018-10-26  9:47       ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2018-10-25 13:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

On Thu, Oct 25, 2018 at 03:22:30PM +0200, Jan Kara wrote:
> On Thu 18-10-18 16:23:18, Josef Bacik wrote:
> > ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> > space, which can take 6 lifetimes, and we could possibly have to wait on
> > writeback on the page, another several lifetimes.  To avoid this simply
> > drop the mmap_sem if we didn't have the cached page and do all of our
> > work and return the appropriate retry error.  If we have the cached page
> > we know we did all the right things to set this page up and we can just
> > carry on.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ...
> > @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> >  
> >  	reserved_space = PAGE_SIZE;
> >  
> > +	/*
> > +	 * We have our cached page from a previous mkwrite, check it to make
> > +	 * sure it's still dirty and our file size matches when we ran mkwrite
> > +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> > +	 * otherwise do the mkwrite again.
> > +	 */
> > +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> > +		lock_page(page);
> > +		if (vmf->cached_size == i_size_read(inode) &&
> > +		    PageDirty(page))
> > +			return VM_FAULT_LOCKED;
> > +		unlock_page(page);
> > +	}
> 
> I guess this is similar to Dave's comment: Why is i_size so special? What
> makes sure that file didn't get modified between time you've prepared
> cached_page and now such that you need to do the preparation again?
> And if indeed metadata prepared for a page cannot change, what's so special
> about it being that particular cached_page?
> 
> Maybe to phrase my objections differently: Your preparations in
> btrfs_page_mkwrite() are obviously related to your filesystem metadata. So
> why cannot you infer from that metadata (extent tree, whatever - I'd use
> extent status tree in ext4) whether that particular file+offset is already
> prepared for writing and just bail out with success in that case?
> 

I was just being overly paranoid, I was afraid of the case where we would
truncate and then extend in between, but now that I actually think about it that
would end up with the page not being on the mapping anymore so we would catch
that case.  I've dropped this part from my current version.  I'm getting some
testing on these patches in production and I'll post them sometime next week
once I'm happy with them.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs
  2018-10-25 13:58     ` Josef Bacik
@ 2018-10-26  9:47       ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2018-10-26  9:47 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, kernel-team, hannes, linux-kernel, tj, david, akpm,
	linux-fsdevel, linux-btrfs, riel, linux-mm

On Thu 25-10-18 09:58:51, Josef Bacik wrote:
> On Thu, Oct 25, 2018 at 03:22:30PM +0200, Jan Kara wrote:
> > On Thu 18-10-18 16:23:18, Josef Bacik wrote:
> > > ->page_mkwrite is extremely expensive in btrfs.  We have to reserve
> > > space, which can take 6 lifetimes, and we could possibly have to wait on
> > > writeback on the page, another several lifetimes.  To avoid this simply
> > > drop the mmap_sem if we didn't have the cached page and do all of our
> > > work and return the appropriate retry error.  If we have the cached page
> > > we know we did all the right things to set this page up and we can just
> > > carry on.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ...
> > > @@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> > >  
> > >  	reserved_space = PAGE_SIZE;
> > >  
> > > +	/*
> > > +	 * We have our cached page from a previous mkwrite, check it to make
> > > +	 * sure it's still dirty and our file size matches when we ran mkwrite
> > > +	 * the last time.  If everything is OK then return VM_FAULT_LOCKED,
> > > +	 * otherwise do the mkwrite again.
> > > +	 */
> > > +	if (vmf->flags & FAULT_FLAG_USED_CACHED) {
> > > +		lock_page(page);
> > > +		if (vmf->cached_size == i_size_read(inode) &&
> > > +		    PageDirty(page))
> > > +			return VM_FAULT_LOCKED;
> > > +		unlock_page(page);
> > > +	}
> > 
> > I guess this is similar to Dave's comment: Why is i_size so special? What
> > makes sure that file didn't get modified between time you've prepared
> > cached_page and now such that you need to do the preparation again?
> > And if indeed metadata prepared for a page cannot change, what's so special
> > about it being that particular cached_page?
> > 
> > Maybe to phrase my objections differently: Your preparations in
> > btrfs_page_mkwrite() are obviously related to your filesystem metadata. So
> > why cannot you infer from that metadata (extent tree, whatever - I'd use
> > extent status tree in ext4) whether that particular file+offset is already
> > prepared for writing and just bail out with success in that case?
> > 
> 
> I was just being overly paranoid, I was afraid of the case where we would
> truncate and then extend in between, but now that I actually think about it that
> would end up with the page not being on the mapping anymore so we would catch
> that case.  I've dropped this part from my current version.  I'm getting some
> testing on these patches in production and I'll post them sometime next week
> once I'm happy with them.  Thanks,

OK, but do you still need the vmf->cached_page stuff? Because I don't see
why even that is necessary...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-10-26 11:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-18 20:23 [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Josef Bacik
2018-10-18 20:23 ` [PATCH 1/7] mm: infrastructure for page fault page caching Josef Bacik
2018-10-18 20:23 ` [PATCH 2/7] mm: drop mmap_sem for page cache read IO submission Josef Bacik
2018-10-19  3:14   ` Dave Chinner
2018-10-18 20:23 ` [PATCH 3/7] mm: drop the mmap_sem in all read fault cases Josef Bacik
2018-10-19  3:21   ` Dave Chinner
2018-10-18 20:23 ` [PATCH 4/7] mm: use the cached page for filemap_fault Josef Bacik
2018-10-19  3:27   ` Dave Chinner
2018-10-18 20:23 ` [PATCH 5/7] mm: add a flag to indicate we used a cached page Josef Bacik
2018-10-19  3:34   ` Dave Chinner
2018-10-18 20:23 ` [PATCH 6/7] mm: allow ->page_mkwrite to do retries Josef Bacik
2018-10-19  3:36   ` Dave Chinner
2018-10-18 20:23 ` [PATCH 7/7] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
2018-10-19  3:48   ` Dave Chinner
2018-10-22 17:56     ` Josef Bacik
2018-10-22 21:31       ` Dave Chinner
2018-10-25 13:22   ` Jan Kara
2018-10-25 13:58     ` Josef Bacik
2018-10-26  9:47       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).