[PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
@ 2023-05-01 17:50 Suren Baghdasaryan
  2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-01 17:50 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, surenb, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

When page fault is handled under VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry
implementation has to drop and reacquire mmap_lock if folio could
not be immediately locked.
Instead of retrying all swapped page faults, retry only when folio
locking fails.
Drivers implementing ops->migrate_to_ram might still rely on mmap_lock,
therefore fall back to mmap_lock in this case.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/filemap.c |  6 ++++++
 mm/memory.c  | 14 +++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..84f39114d4de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
  *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
  *     which case mmap_lock is still held.
+ *     If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
+ *     with VMA lock only, the VMA lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
  * with the folio locked and the mmap_lock unperturbed.
@@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
 			 unsigned int flags)
 {
+	/* Can't do this if not holding mmap_lock */
+	if (flags & FAULT_FLAG_VMA_LOCK)
+		return false;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..41f45819a923 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		ret = VM_FAULT_RETRY;
-		goto out;
-	}
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
@@ -3725,6 +3720,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+				/*
+				 * migrate_to_ram is not yet ready to operate
+				 * under VMA lock.
+				 */
+				ret |= VM_FAULT_RETRY;
+				goto out;
+			}
+
 			vmf->page = pfn_swap_entry_to_page(entry);
 			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 					vmf->address, &vmf->ptl);
-- 
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-01 17:50 [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Suren Baghdasaryan
@ 2023-05-01 17:50 ` Suren Baghdasaryan
  2023-05-02 13:21   ` Alistair Popple
  2023-05-02 14:28   ` Matthew Wilcox
  2023-05-01 17:50 ` [PATCH 3/3] mm: implement folio wait under VMA lock Suren Baghdasaryan
  2023-05-02  2:02 ` [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Matthew Wilcox
  2 siblings, 2 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-01 17:50 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, surenb, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

migration_entry_wait does not need VMA lock, therefore it can be dropped
before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
lock was dropped while in handle_mm_fault().
Note that once VMA lock is dropped, the VMA reference can't be used as
there are no guarantees it was not freed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/arm64/mm/fault.c    |  3 ++-
 arch/powerpc/mm/fault.c  |  3 ++-
 arch/s390/mm/fault.c     |  3 ++-
 arch/x86/mm/fault.c      |  3 ++-
 include/linux/mm_types.h |  6 +++++-
 mm/memory.c              | 12 ++++++++++--
 6 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 9e0db5c387e3..8fa281f49d61 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -602,7 +602,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	}
 	fault = handle_mm_fault(vma, addr & PAGE_MASK,
 				mm_flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & VM_FAULT_VMA_UNLOCKED))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 531177a4ee08..b27730f07141 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -494,7 +494,8 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & VM_FAULT_VMA_UNLOCKED))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index b65144c392b0..cc923dbb0821 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -418,7 +418,8 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 		goto lock_mmap;
 	}
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & VM_FAULT_VMA_UNLOCKED))
+		vma_end_read(vma);
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
 		goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e4399983c50c..ef62ab2fd211 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1347,7 +1347,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 		goto lock_mmap;
 	}
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & VM_FAULT_VMA_UNLOCKED))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..b3b57c6da0e1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1030,6 +1030,7 @@ typedef __bitwise unsigned int vm_fault_t;
  *				fsync() to complete (for synchronous page faults
  *				in DAX)
  * @VM_FAULT_COMPLETED:		->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_VMA_UNLOCKED:	VMA lock was released
  * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
  *
  */
@@ -1047,6 +1048,7 @@ enum vm_fault_reason {
 	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
 	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
 	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+	VM_FAULT_VMA_UNLOCKED   = (__force vm_fault_t)0x008000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
@@ -1070,7 +1072,9 @@ enum vm_fault_reason {
 	{ VM_FAULT_RETRY,               "RETRY" },	\
 	{ VM_FAULT_FALLBACK,            "FALLBACK" },	\
 	{ VM_FAULT_DONE_COW,            "DONE_COW" },	\
-	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
+	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },	\
+	{ VM_FAULT_COMPLETED,           "COMPLETED" },	\
+	{ VM_FAULT_VMA_UNLOCKED,        "VMA_UNLOCKED" }
 
 struct vm_special_mapping {
 	const char *name;	/* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 41f45819a923..8222acf74fd3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-			migration_entry_wait(vma->vm_mm, vmf->pmd,
-					     vmf->address);
+			/* Save mm in case VMA lock is dropped */
+			struct mm_struct *mm = vma->vm_mm;
+
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+				/* No need to hold VMA lock for migration */
+				vma_end_read(vma);
+				/* CAUTION! VMA can't be used after this */
+				ret |= VM_FAULT_VMA_UNLOCKED;
+			}
+			migration_entry_wait(mm, vmf->pmd, vmf->address);
 		} else if (is_device_exclusive_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
-- 
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] mm: implement folio wait under VMA lock
  2023-05-01 17:50 [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Suren Baghdasaryan
  2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
@ 2023-05-01 17:50 ` Suren Baghdasaryan
  2023-05-02  2:02 ` [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Matthew Wilcox
  2 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-01 17:50 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, surenb, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

Follow the same pattern as mmap_lock when waiting for folio by dropping
VMA lock before the wait and retrying once folio is available.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/pagemap.h | 14 ++++++++++----
 mm/filemap.c            | 43 ++++++++++++++++++++++-------------------
 mm/memory.c             | 13 +++++++++----
 3 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..6c9493314c21 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -896,8 +896,8 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-				unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned int flags, bool *lock_dropped);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -1002,10 +1002,16 @@ static inline int folio_lock_killable(struct folio *folio)
  * __folio_lock_or_retry().
  */
 static inline bool folio_lock_or_retry(struct folio *folio,
-		struct mm_struct *mm, unsigned int flags)
+		struct vm_area_struct *vma, unsigned int flags,
+		bool *lock_dropped)
 {
 	might_sleep();
-	return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+	if (folio_trylock(folio)) {
+		*lock_dropped = false;
+		return true;
+	}
+
+	return __folio_lock_or_retry(folio, vma, flags, lock_dropped);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 84f39114d4de..9c0fa8578b2f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1701,37 +1701,35 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 
 /*
  * Return values:
- * true - folio is locked; mmap_lock is still held.
+ * true - folio is locked.
  * false - folio is not locked.
- *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
- *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
- *     which case mmap_lock is still held.
- *     If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
- *     with VMA lock only, the VMA lock is still held.
+ *
+ * lock_dropped indicates whether mmap_lock/VMA lock got dropped.
+ *     mmap_lock/VMA lock is dropped when function fails to lock the folio,
+ *     unless flags had both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT
+ *     set, in which case mmap_lock/VMA lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
- * with the folio locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock/VMA lock unperturbed.
  */
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-			 unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
+			 unsigned int flags, bool *lock_dropped)
 {
-	/* Can't do this if not holding mmap_lock */
-	if (flags & FAULT_FLAG_VMA_LOCK)
-		return false;
-
 	if (fault_flag_allow_retry_first(flags)) {
-		/*
-		 * CAUTION! In this case, mmap_lock is not released
-		 * even though return 0.
-		 */
-		if (flags & FAULT_FLAG_RETRY_NOWAIT)
+		if (flags & FAULT_FLAG_RETRY_NOWAIT) {
+			*lock_dropped = false;
 			return false;
+		}
 
-		mmap_read_unlock(mm);
+		if (flags & FAULT_FLAG_VMA_LOCK)
+			vma_end_read(vma);
+		else
+			mmap_read_unlock(vma->vm_mm);
 		if (flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
 		else
 			folio_wait_locked(folio);
+		*lock_dropped = true;
 		return false;
 	}
 	if (flags & FAULT_FLAG_KILLABLE) {
@@ -1739,13 +1737,18 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
 
 		ret = __folio_lock_killable(folio);
 		if (ret) {
-			mmap_read_unlock(mm);
+			if (flags & FAULT_FLAG_VMA_LOCK)
+				vma_end_read(vma);
+			else
+				mmap_read_unlock(vma->vm_mm);
+			*lock_dropped = true;
 			return false;
 		}
 	} else {
 		__folio_lock(folio);
 	}
 
+	*lock_dropped = false;
 	return true;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 8222acf74fd3..e1cd39f00756 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3568,6 +3568,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct folio *folio = page_folio(vmf->page);
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
+	bool lock_dropped;
 
 	/*
 	 * We need a reference to lock the folio because we don't hold
@@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	if (!folio_try_get(folio))
 		return 0;
 
-	if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+	if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
 		folio_put(folio);
+		if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
+			return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
 		return VM_FAULT_RETRY;
 	}
 	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
@@ -3704,7 +3707,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
-	int locked;
+	bool locked;
+	bool lock_dropped;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3837,9 +3841,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
-
+	locked = folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped);
 	if (!locked) {
+		if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
+			ret |= VM_FAULT_VMA_UNLOCKED;
 		ret |= VM_FAULT_RETRY;
 		goto out_release;
 	}
-- 
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-01 17:50 [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Suren Baghdasaryan
  2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
  2023-05-01 17:50 ` [PATCH 3/3] mm: implement folio wait under VMA lock Suren Baghdasaryan
@ 2023-05-02  2:02 ` Matthew Wilcox
  2023-05-02  2:30   ` Suren Baghdasaryan
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02  2:02 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> +++ b/mm/memory.c
> @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!pte_unmap_same(vmf))
>  		goto out;
>  
> -	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> -		ret = VM_FAULT_RETRY;
> -		goto out;
> -	}
> -
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {

You're missing the necessary fallback in the (!folio) case.
swap_readpage() is synchronous and will sleep.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02  2:02 ` [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Matthew Wilcox
@ 2023-05-02  2:30   ` Suren Baghdasaryan
  2023-05-02  3:22     ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02  2:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > +++ b/mm/memory.c
> > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> >
> > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > -             ret = VM_FAULT_RETRY;
> > -             goto out;
> > -     }
> > -
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       if (unlikely(non_swap_entry(entry))) {
> >               if (is_migration_entry(entry)) {
>
> You're missing the necessary fallback in the (!folio) case.
> swap_readpage() is synchronous and will sleep.

True, but is it unsafe to do that under VMA lock and has to be done
under mmap_lock?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02  2:30   ` Suren Baghdasaryan
@ 2023-05-02  3:22     ` Matthew Wilcox
  2023-05-02  5:04       ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02  3:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > +++ b/mm/memory.c
> > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >       if (!pte_unmap_same(vmf))
> > >               goto out;
> > >
> > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > -             ret = VM_FAULT_RETRY;
> > > -             goto out;
> > > -     }
> > > -
> > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > >       if (unlikely(non_swap_entry(entry))) {
> > >               if (is_migration_entry(entry)) {
> >
> > You're missing the necessary fallback in the (!folio) case.
> > swap_readpage() is synchronous and will sleep.
> 
> True, but is it unsafe to do that under VMA lock and has to be done
> under mmap_lock?

... you were the one arguing that we didn't want to wait for I/O with
the VMA lock held?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02  3:22     ` Matthew Wilcox
@ 2023-05-02  5:04       ` Suren Baghdasaryan
  2023-05-02 15:03         ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02  5:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > +++ b/mm/memory.c
> > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > >       if (!pte_unmap_same(vmf))
> > > >               goto out;
> > > >
> > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > -             ret = VM_FAULT_RETRY;
> > > > -             goto out;
> > > > -     }
> > > > -
> > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > >       if (unlikely(non_swap_entry(entry))) {
> > > >               if (is_migration_entry(entry)) {
> > >
> > > You're missing the necessary fallback in the (!folio) case.
> > > swap_readpage() is synchronous and will sleep.
> >
> > True, but is it unsafe to do that under VMA lock and has to be done
> > under mmap_lock?
>
> ... you were the one arguing that we didn't want to wait for I/O with
> the VMA lock held?

Well, that discussion was about waiting in folio_lock_or_retry() with
the lock being held. I argued against it because currently we drop
mmap_lock lock before waiting, so if we don't drop VMA lock we would
be changing the current behavior which might introduce new
regressions. In the case of swap_readpage and swapin_readahead we
already wait with mmap_lock held, so waiting with VMA lock held does
not introduce new problems (unless there is a need to hold mmap_lock).

That said, you are absolutely correct that this situation can be
improved by dropping the lock in these cases too. I just didn't want
to attack everything at once. I believe after we agree on the approach
implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
for dropping the VMA lock before waiting, these cases can be added
easier. Does that make sense?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
@ 2023-05-02 13:21   ` Alistair Popple
  2023-05-02 16:39     ` Suren Baghdasaryan
  2023-05-02 14:28   ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Alistair Popple @ 2023-05-02 13:21 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team


Suren Baghdasaryan <surenb@google.com> writes:

[...]

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..b3b57c6da0e1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1030,6 +1030,7 @@ typedef __bitwise unsigned int vm_fault_t;
>   *				fsync() to complete (for synchronous page faults
>   *				in DAX)
>   * @VM_FAULT_COMPLETED:		->fault completed, meanwhile mmap lock released
> + * @VM_FAULT_VMA_UNLOCKED:	VMA lock was released

A note here saying vmf->vma should no longer be accessed would be nice.

>   * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
>   *
>   */
> @@ -1047,6 +1048,7 @@ enum vm_fault_reason {
>  	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
>  	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
>  	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> +	VM_FAULT_VMA_UNLOCKED   = (__force vm_fault_t)0x008000,
>  	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
>  };
>  
> @@ -1070,7 +1072,9 @@ enum vm_fault_reason {
>  	{ VM_FAULT_RETRY,               "RETRY" },	\
>  	{ VM_FAULT_FALLBACK,            "FALLBACK" },	\
>  	{ VM_FAULT_DONE_COW,            "DONE_COW" },	\
> -	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
> +	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },	\
> +	{ VM_FAULT_COMPLETED,           "COMPLETED" },	\

VM_FAULT_COMPLETED isn't used in this patch, guessing that's snuck in
from one of the other patches in the series?

> +	{ VM_FAULT_VMA_UNLOCKED,        "VMA_UNLOCKED" }
>  
>  struct vm_special_mapping {
>  	const char *name;	/* The name, e.g. "[vdso]". */
> diff --git a/mm/memory.c b/mm/memory.c
> index 41f45819a923..8222acf74fd3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> -			migration_entry_wait(vma->vm_mm, vmf->pmd,
> -					     vmf->address);
> +			/* Save mm in case VMA lock is dropped */
> +			struct mm_struct *mm = vma->vm_mm;
> +
> +			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +				/* No need to hold VMA lock for migration */
> +				vma_end_read(vma);
> +				/* CAUTION! VMA can't be used after this */
> +				ret |= VM_FAULT_VMA_UNLOCKED;
> +			}
> +			migration_entry_wait(mm, vmf->pmd, vmf->address);
>  		} else if (is_device_exclusive_entry(entry)) {
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			ret = remove_device_exclusive_entry(vmf);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
  2023-05-02 13:21   ` Alistair Popple
@ 2023-05-02 14:28   ` Matthew Wilcox
  2023-05-02 16:41     ` Suren Baghdasaryan
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02 14:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 01, 2023 at 10:50:24AM -0700, Suren Baghdasaryan wrote:
> migration_entry_wait does not need VMA lock, therefore it can be dropped
> before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> lock was dropped while in handle_mm_fault().
> Note that once VMA lock is dropped, the VMA reference can't be used as
> there are no guarantees it was not freed.

How about we introduce:

void vmf_end_read(struct vm_fault *vmf)
{
	if (!vmf->vma)
		return;
	vma_end_read(vmf->vma);
	vmf->vma = NULL;
}

Now we don't need a new flag, and calling vmf_end_read() is idempotent.

Oh, argh, we create the vmf too late.  We really need to hoist the
creation of vm_fault to the callers of handle_mm_fault().


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02  5:04       ` Suren Baghdasaryan
@ 2023-05-02 15:03         ` Matthew Wilcox
  2023-05-02 16:36           ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02 15:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > +++ b/mm/memory.c
> > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >       if (!pte_unmap_same(vmf))
> > > > >               goto out;
> > > > >
> > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > -             ret = VM_FAULT_RETRY;
> > > > > -             goto out;
> > > > > -     }
> > > > > -
> > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > >               if (is_migration_entry(entry)) {
> > > >
> > > > You're missing the necessary fallback in the (!folio) case.
> > > > swap_readpage() is synchronous and will sleep.
> > >
> > > True, but is it unsafe to do that under VMA lock and has to be done
> > > under mmap_lock?
> >
> > ... you were the one arguing that we didn't want to wait for I/O with
> > the VMA lock held?
> 
> Well, that discussion was about waiting in folio_lock_or_retry() with
> the lock being held. I argued against it because currently we drop
> mmap_lock lock before waiting, so if we don't drop VMA lock we would
> be changing the current behavior which might introduce new
> regressions. In the case of swap_readpage and swapin_readahead we
> already wait with mmap_lock held, so waiting with VMA lock held does
> not introduce new problems (unless there is a need to hold mmap_lock).
> 
> That said, you are absolutely correct that this situation can be
> improved by dropping the lock in these cases too. I just didn't want
> to attack everything at once. I believe after we agree on the approach
> implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> for dropping the VMA lock before waiting, these cases can be added
> easier. Does that make sense?

OK, I looked at this path some more, and I think we're fine.  This
patch is only called for SWP_SYNCHRONOUS_IO which is only set for
QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
(both btt and pmem).  So the answer is that we don't sleep in this
path, and there's no need to drop the lock.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 15:03         ` Matthew Wilcox
@ 2023-05-02 16:36           ` Suren Baghdasaryan
  2023-05-02 22:31             ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > >       if (!pte_unmap_same(vmf))
> > > > > >               goto out;
> > > > > >
> > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > -             goto out;
> > > > > > -     }
> > > > > > -
> > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > >               if (is_migration_entry(entry)) {
> > > > >
> > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > swap_readpage() is synchronous and will sleep.
> > > >
> > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > under mmap_lock?
> > >
> > > ... you were the one arguing that we didn't want to wait for I/O with
> > > the VMA lock held?
> >
> > Well, that discussion was about waiting in folio_lock_or_retry() with
> > the lock being held. I argued against it because currently we drop
> > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > be changing the current behavior which might introduce new
> > regressions. In the case of swap_readpage and swapin_readahead we
> > already wait with mmap_lock held, so waiting with VMA lock held does
> > not introduce new problems (unless there is a need to hold mmap_lock).
> >
> > That said, you are absolutely correct that this situation can be
> > improved by dropping the lock in these cases too. I just didn't want
> > to attack everything at once. I believe after we agree on the approach
> > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > for dropping the VMA lock before waiting, these cases can be added
> > easier. Does that make sense?
>
> OK, I looked at this path some more, and I think we're fine.  This
> patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> (both btt and pmem).  So the answer is that we don't sleep in this
> path, and there's no need to drop the lock.

Yes but swapin_readahead does sleep, so I'll have to handle that case
too after this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-02 13:21   ` Alistair Popple
@ 2023-05-02 16:39     ` Suren Baghdasaryan
  2023-05-03 13:03       ` Alistair Popple
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02 16:39 UTC (permalink / raw)
  To: Alistair Popple
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

On Tue, May 2, 2023 at 6:26 AM 'Alistair Popple' via kernel-team
<kernel-team@android.com> wrote:
>
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> [...]
>
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..b3b57c6da0e1 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1030,6 +1030,7 @@ typedef __bitwise unsigned int vm_fault_t;
> >   *                           fsync() to complete (for synchronous page faults
> >   *                           in DAX)
> >   * @VM_FAULT_COMPLETED:              ->fault completed, meanwhile mmap lock released
> > + * @VM_FAULT_VMA_UNLOCKED:   VMA lock was released
>
> A note here saying vmf->vma should no longer be accessed would be nice.

Good idea. Will add in the next version. Thanks!

>
> >   * @VM_FAULT_HINDEX_MASK:    mask HINDEX value
> >   *
> >   */
> > @@ -1047,6 +1048,7 @@ enum vm_fault_reason {
> >       VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
> >       VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> >       VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> > +     VM_FAULT_VMA_UNLOCKED   = (__force vm_fault_t)0x008000,
> >       VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
> >  };
> >
> > @@ -1070,7 +1072,9 @@ enum vm_fault_reason {
> >       { VM_FAULT_RETRY,               "RETRY" },      \
> >       { VM_FAULT_FALLBACK,            "FALLBACK" },   \
> >       { VM_FAULT_DONE_COW,            "DONE_COW" },   \
> > -     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
> > +     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },  \
> > +     { VM_FAULT_COMPLETED,           "COMPLETED" },  \
>
> VM_FAULT_COMPLETED isn't used in this patch, guessing that's snuck in
> from one of the other patches in the series?

I noticed that an entry for VM_FAULT_COMPLETED was missing and wanted
to fix that... Should I drop that?

>
> > +     { VM_FAULT_VMA_UNLOCKED,        "VMA_UNLOCKED" }
> >
> >  struct vm_special_mapping {
> >       const char *name;       /* The name, e.g. "[vdso]". */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 41f45819a923..8222acf74fd3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       if (unlikely(non_swap_entry(entry))) {
> >               if (is_migration_entry(entry)) {
> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
> > -                                          vmf->address);
> > +                     /* Save mm in case VMA lock is dropped */
> > +                     struct mm_struct *mm = vma->vm_mm;
> > +
> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                             /* No need to hold VMA lock for migration */
> > +                             vma_end_read(vma);
> > +                             /* CAUTION! VMA can't be used after this */
> > +                             ret |= VM_FAULT_VMA_UNLOCKED;
> > +                     }
> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
> >               } else if (is_device_exclusive_entry(entry)) {
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       ret = remove_device_exclusive_entry(vmf);
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-02 14:28   ` Matthew Wilcox
@ 2023-05-02 16:41     ` Suren Baghdasaryan
  0 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02 16:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 2, 2023 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 10:50:24AM -0700, Suren Baghdasaryan wrote:
> > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > lock was dropped while in handle_mm_fault().
> > Note that once VMA lock is dropped, the VMA reference can't be used as
> > there are no guarantees it was not freed.
>
> How about we introduce:
>
> void vmf_end_read(struct vm_fault *vmf)
> {
>         if (!vmf->vma)
>                 return;
>         vma_end_read(vmf->vma);
>         vmf->vma = NULL;
> }
>
> Now we don't need a new flag, and calling vmf_end_read() is idempotent.
>
> Oh, argh, we create the vmf too late.  We really need to hoist the
> creation of vm_fault to the callers of handle_mm_fault().

Yeah, unfortunately vmf does not propagate all the way up to
do_user_addr_fault which needs to know that we dropped the lock.

>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 16:36           ` Suren Baghdasaryan
@ 2023-05-02 22:31             ` Matthew Wilcox
  2023-05-02 23:04               ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02 22:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > >               goto out;
> > > > > > >
> > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > -             goto out;
> > > > > > > -     }
> > > > > > > -
> > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > >               if (is_migration_entry(entry)) {
> > > > > >
> > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > swap_readpage() is synchronous and will sleep.
> > > > >
> > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > under mmap_lock?
> > > >
> > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > the VMA lock held?
> > >
> > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > the lock being held. I argued against it because currently we drop
> > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > be changing the current behavior which might introduce new
> > > regressions. In the case of swap_readpage and swapin_readahead we
> > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > not introduce new problems (unless there is a need to hold mmap_lock).
> > >
> > > That said, you are absolutely correct that this situation can be
> > > improved by dropping the lock in these cases too. I just didn't want
> > > to attack everything at once. I believe after we agree on the approach
> > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > for dropping the VMA lock before waiting, these cases can be added
> > > easier. Does that make sense?
> >
> > OK, I looked at this path some more, and I think we're fine.  This
> > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > (both btt and pmem).  So the answer is that we don't sleep in this
> > path, and there's no need to drop the lock.
> 
> Yes but swapin_readahead does sleep, so I'll have to handle that case
> too after this.

Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
in swapin_readahead()?  It all looks like async I/O to me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 22:31             ` Matthew Wilcox
@ 2023-05-02 23:04               ` Suren Baghdasaryan
  2023-05-02 23:40                 ` Matthew Wilcox
  2023-05-03  8:34                 ` Yosry Ahmed
  0 siblings, 2 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-02 23:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > +++ b/mm/memory.c
> > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > >               goto out;
> > > > > > > >
> > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > -             goto out;
> > > > > > > > -     }
> > > > > > > > -
> > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > >
> > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > >
> > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > under mmap_lock?
> > > > >
> > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > the VMA lock held?
> > > >
> > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > the lock being held. I argued against it because currently we drop
> > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > be changing the current behavior which might introduce new
> > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > >
> > > > That said, you are absolutely correct that this situation can be
> > > > improved by dropping the lock in these cases too. I just didn't want
> > > > to attack everything at once. I believe after we agree on the approach
> > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > for dropping the VMA lock before waiting, these cases can be added
> > > > easier. Does that make sense?
> > >
> > > OK, I looked at this path some more, and I think we're fine.  This
> > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > path, and there's no need to drop the lock.
> >
> > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > too after this.
>
> Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> in swapin_readahead()?  It all looks like async I/O to me.

Hmm. I thought that we have synchronous I/O in the following paths:
    swapin_readahead()->swap_cluster_readahead()->swap_readpage()
    swapin_readahead()->swap_vma_readahead()->swap_readpage()
but just noticed that in both cases swap_readpage() is called with the
synchronous parameter being false. So you are probably right here...
Does that mean swapin_readahead() might return a page which does not
have its content swapped-in yet?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 23:04               ` Suren Baghdasaryan
@ 2023-05-02 23:40                 ` Matthew Wilcox
  2023-05-03  1:05                   ` Suren Baghdasaryan
  2023-05-03  8:34                 ` Yosry Ahmed
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-02 23:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote:
> On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > >               goto out;
> > > > > > > > >
> > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > -             goto out;
> > > > > > > > > -     }
> > > > > > > > > -
> > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > >
> > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > >
> > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > under mmap_lock?
> > > > > >
> > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > the VMA lock held?
> > > > >
> > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > the lock being held. I argued against it because currently we drop
> > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > be changing the current behavior which might introduce new
> > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > >
> > > > > That said, you are absolutely correct that this situation can be
> > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > to attack everything at once. I believe after we agree on the approach
> > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > easier. Does that make sense?
> > > >
> > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > path, and there's no need to drop the lock.
> > >
> > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > too after this.
> >
> > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > in swapin_readahead()?  It all looks like async I/O to me.
> 
> Hmm. I thought that we have synchronous I/O in the following paths:
>     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> but just noticed that in both cases swap_readpage() is called with the
> synchronous parameter being false. So you are probably right here...
> Does that mean swapin_readahead() might return a page which does not
> have its content swapped-in yet?

That's my understanding.  In that case it's !uptodate and still locked.
The folio_lock_or_retry() will wait for the read to complete unless
we've told it we'd rather retry.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 23:40                 ` Matthew Wilcox
@ 2023-05-03  1:05                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-03  1:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 2, 2023 at 4:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote:
> > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > >               goto out;
> > > > > > > > > >
> > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > -             goto out;
> > > > > > > > > > -     }
> > > > > > > > > > -
> > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > >
> > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > >
> > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > under mmap_lock?
> > > > > > >
> > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > the VMA lock held?
> > > > > >
> > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > the lock being held. I argued against it because currently we drop
> > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > be changing the current behavior which might introduce new
> > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > >
> > > > > > That said, you are absolutely correct that this situation can be
> > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > easier. Does that make sense?
> > > > >
> > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > path, and there's no need to drop the lock.
> > > >
> > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > too after this.
> > >
> > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > in swapin_readahead()?  It all looks like async I/O to me.
> >
> > Hmm. I thought that we have synchronous I/O in the following paths:
> >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > but just noticed that in both cases swap_readpage() is called with the
> > synchronous parameter being false. So you are probably right here...
> > Does that mean swapin_readahead() might return a page which does not
> > have its content swapped-in yet?
>
> That's my understanding.  In that case it's !uptodate and still locked.
> The folio_lock_or_retry() will wait for the read to complete unless
> we've told it we'd rather retry.

Ok, and we already drop the locks in folio_lock_or_retry() when
needed. Sounds like we cover this case with this patchset?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-02 23:04               ` Suren Baghdasaryan
  2023-05-02 23:40                 ` Matthew Wilcox
@ 2023-05-03  8:34                 ` Yosry Ahmed
  2023-05-03 19:57                   ` Suren Baghdasaryan
  1 sibling, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2023-05-03  8:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > >               goto out;
> > > > > > > > >
> > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > -             goto out;
> > > > > > > > > -     }
> > > > > > > > > -
> > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > >
> > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > >
> > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > under mmap_lock?
> > > > > >
> > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > the VMA lock held?
> > > > >
> > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > the lock being held. I argued against it because currently we drop
> > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > be changing the current behavior which might introduce new
> > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > >
> > > > > That said, you are absolutely correct that this situation can be
> > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > to attack everything at once. I believe after we agree on the approach
> > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > easier. Does that make sense?
> > > >
> > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > path, and there's no need to drop the lock.
> > >
> > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > too after this.
> >
> > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > in swapin_readahead()?  It all looks like async I/O to me.
>
> Hmm. I thought that we have synchronous I/O in the following paths:
>     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> but just noticed that in both cases swap_readpage() is called with the
> synchronous parameter being false. So you are probably right here...

In both swap_cluster_readahead() and swap_vma_readahead() it looks
like if the readahead window is 1 (aka we are not reading ahead), then
we jump to directly calling read_swap_cache_async() passing do_poll =
true, which means we may end up calling swap_readpage() passing
synchronous = true.

I am not familiar with readahead heuristics, so I am not sure how
common this is, but it's something to think about.

> Does that mean swapin_readahead() might return a page which does not
> have its content swapped-in yet?
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-02 16:39     ` Suren Baghdasaryan
@ 2023-05-03 13:03       ` Alistair Popple
  2023-05-03 19:42         ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: Alistair Popple @ 2023-05-03 13:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team


Suren Baghdasaryan <surenb@google.com> writes:

> On Tue, May 2, 2023 at 6:26 AM 'Alistair Popple' via kernel-team
> <kernel-team@android.com> wrote:
>>
>>
>> Suren Baghdasaryan <surenb@google.com> writes:
>>
>> [...]
>>
>> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> > index 306a3d1a0fa6..b3b57c6da0e1 100644
>> > --- a/include/linux/mm_types.h
>> > +++ b/include/linux/mm_types.h
>> > @@ -1030,6 +1030,7 @@ typedef __bitwise unsigned int vm_fault_t;
>> >   *                           fsync() to complete (for synchronous page faults
>> >   *                           in DAX)
>> >   * @VM_FAULT_COMPLETED:              ->fault completed, meanwhile mmap lock released
>> > + * @VM_FAULT_VMA_UNLOCKED:   VMA lock was released
>>
>> A note here saying vmf->vma should no longer be accessed would be nice.
>
> Good idea. Will add in the next version. Thanks!
>
>>
>> >   * @VM_FAULT_HINDEX_MASK:    mask HINDEX value
>> >   *
>> >   */
>> > @@ -1047,6 +1048,7 @@ enum vm_fault_reason {
>> >       VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
>> >       VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
>> >       VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
>> > +     VM_FAULT_VMA_UNLOCKED   = (__force vm_fault_t)0x008000,
>> >       VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
>> >  };
>> >
>> > @@ -1070,7 +1072,9 @@ enum vm_fault_reason {
>> >       { VM_FAULT_RETRY,               "RETRY" },      \
>> >       { VM_FAULT_FALLBACK,            "FALLBACK" },   \
>> >       { VM_FAULT_DONE_COW,            "DONE_COW" },   \
>> > -     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
>> > +     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },  \
>> > +     { VM_FAULT_COMPLETED,           "COMPLETED" },  \
>>
>> VM_FAULT_COMPLETED isn't used in this patch, guessing that's snuck in
>> from one of the other patches in the series?
>
> I noticed that an entry for VM_FAULT_COMPLETED was missing and wanted
> to fix that... Should I drop that?

Oh ok. It would certainly be good to add but really it should be it's
own patch.

>>
>> > +     { VM_FAULT_VMA_UNLOCKED,        "VMA_UNLOCKED" }
>> >
>> >  struct vm_special_mapping {
>> >       const char *name;       /* The name, e.g. "[vdso]". */
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 41f45819a923..8222acf74fd3 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       entry = pte_to_swp_entry(vmf->orig_pte);
>> >       if (unlikely(non_swap_entry(entry))) {
>> >               if (is_migration_entry(entry)) {
>> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
>> > -                                          vmf->address);
>> > +                     /* Save mm in case VMA lock is dropped */
>> > +                     struct mm_struct *mm = vma->vm_mm;
>> > +
>> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> > +                             /* No need to hold VMA lock for migration */
>> > +                             vma_end_read(vma);
>> > +                             /* CAUTION! VMA can't be used after this */
>> > +                             ret |= VM_FAULT_VMA_UNLOCKED;
>> > +                     }
>> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
>> >               } else if (is_device_exclusive_entry(entry)) {
>> >                       vmf->page = pfn_swap_entry_to_page(entry);
>> >                       ret = remove_device_exclusive_entry(vmf);
>>
>> --
>> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration
  2023-05-03 13:03       ` Alistair Popple
@ 2023-05-03 19:42         ` Suren Baghdasaryan
  0 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-03 19:42 UTC (permalink / raw)
  To: Alistair Popple
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

On Wed, May 3, 2023 at 6:05 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> > On Tue, May 2, 2023 at 6:26 AM 'Alistair Popple' via kernel-team
> > <kernel-team@android.com> wrote:
> >>
> >>
> >> Suren Baghdasaryan <surenb@google.com> writes:
> >>
> >> [...]
> >>
> >> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> > index 306a3d1a0fa6..b3b57c6da0e1 100644
> >> > --- a/include/linux/mm_types.h
> >> > +++ b/include/linux/mm_types.h
> >> > @@ -1030,6 +1030,7 @@ typedef __bitwise unsigned int vm_fault_t;
> >> >   *                           fsync() to complete (for synchronous page faults
> >> >   *                           in DAX)
> >> >   * @VM_FAULT_COMPLETED:              ->fault completed, meanwhile mmap lock released
> >> > + * @VM_FAULT_VMA_UNLOCKED:   VMA lock was released
> >>
> >> A note here saying vmf->vma should no longer be accessed would be nice.
> >
> > Good idea. Will add in the next version. Thanks!
> >
> >>
> >> >   * @VM_FAULT_HINDEX_MASK:    mask HINDEX value
> >> >   *
> >> >   */
> >> > @@ -1047,6 +1048,7 @@ enum vm_fault_reason {
> >> >       VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
> >> >       VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> >> >       VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> >> > +     VM_FAULT_VMA_UNLOCKED   = (__force vm_fault_t)0x008000,
> >> >       VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
> >> >  };
> >> >
> >> > @@ -1070,7 +1072,9 @@ enum vm_fault_reason {
> >> >       { VM_FAULT_RETRY,               "RETRY" },      \
> >> >       { VM_FAULT_FALLBACK,            "FALLBACK" },   \
> >> >       { VM_FAULT_DONE_COW,            "DONE_COW" },   \
> >> > -     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
> >> > +     { VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },  \
> >> > +     { VM_FAULT_COMPLETED,           "COMPLETED" },  \
> >>
> >> VM_FAULT_COMPLETED isn't used in this patch, guessing that's snuck in
> >> from one of the other patches in the series?
> >
> > I noticed that an entry for VM_FAULT_COMPLETED was missing and wanted
> > to fix that... Should I drop that?
>
> Oh ok. It would certainly be good to add but really it should be it's
> own patch.

Ack. Will split in the next version. Thanks!

>
> >>
> >> > +     { VM_FAULT_VMA_UNLOCKED,        "VMA_UNLOCKED" }
> >> >
> >> >  struct vm_special_mapping {
> >> >       const char *name;       /* The name, e.g. "[vdso]". */
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 41f45819a923..8222acf74fd3 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >> >       if (unlikely(non_swap_entry(entry))) {
> >> >               if (is_migration_entry(entry)) {
> >> > -                     migration_entry_wait(vma->vm_mm, vmf->pmd,
> >> > -                                          vmf->address);
> >> > +                     /* Save mm in case VMA lock is dropped */
> >> > +                     struct mm_struct *mm = vma->vm_mm;
> >> > +
> >> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> >> > +                             /* No need to hold VMA lock for migration */
> >> > +                             vma_end_read(vma);
> >> > +                             /* CAUTION! VMA can't be used after this */
> >> > +                             ret |= VM_FAULT_VMA_UNLOCKED;
> >> > +                     }
> >> > +                     migration_entry_wait(mm, vmf->pmd, vmf->address);
> >> >               } else if (is_device_exclusive_entry(entry)) {
> >> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >> >                       ret = remove_device_exclusive_entry(vmf);
> >>
> >> --
> >> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-03  8:34                 ` Yosry Ahmed
@ 2023-05-03 19:57                   ` Suren Baghdasaryan
  2023-05-03 20:57                     ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-03 19:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Matthew Wilcox, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > >               goto out;
> > > > > > > > > >
> > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > -             goto out;
> > > > > > > > > > -     }
> > > > > > > > > > -
> > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > >
> > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > >
> > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > under mmap_lock?
> > > > > > >
> > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > the VMA lock held?
> > > > > >
> > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > the lock being held. I argued against it because currently we drop
> > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > be changing the current behavior which might introduce new
> > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > >
> > > > > > That said, you are absolutely correct that this situation can be
> > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > easier. Does that make sense?
> > > > >
> > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > path, and there's no need to drop the lock.
> > > >
> > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > too after this.
> > >
> > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > in swapin_readahead()?  It all looks like async I/O to me.
> >
> > Hmm. I thought that we have synchronous I/O in the following paths:
> >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > but just noticed that in both cases swap_readpage() is called with the
> > synchronous parameter being false. So you are probably right here...
>
> In both swap_cluster_readahead() and swap_vma_readahead() it looks
> like if the readahead window is 1 (aka we are not reading ahead), then
> we jump to directly calling read_swap_cache_async() passing do_poll =
> true, which means we may end up calling swap_readpage() passing
> synchronous = true.
>
> I am not familiar with readahead heuristics, so I am not sure how
> common this is, but it's something to think about.

Uh, you are correct. If this branch is common, we could use the same
"drop the lock and retry" pattern inside read_swap_cache_async(). That
would be quite easy to implement.
Thanks for checking on it!

>
> > Does that mean swapin_readahead() might return a page which does not
> > have its content swapped-in yet?
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-03 19:57                   ` Suren Baghdasaryan
@ 2023-05-03 20:57                     ` Yosry Ahmed
  2023-05-05  5:02                       ` Huang, Ying
  0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2023-05-03 20:57 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ying
  Cc: Matthew Wilcox, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > > >               goto out;
> > > > > > > > > > >
> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > > -             goto out;
> > > > > > > > > > > -     }
> > > > > > > > > > > -
> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > > >
> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > > >
> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > > under mmap_lock?
> > > > > > > >
> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > > the VMA lock held?
> > > > > > >
> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > > the lock being held. I argued against it because currently we drop
> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > > be changing the current behavior which might introduce new
> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > > >
> > > > > > > That said, you are absolutely correct that this situation can be
> > > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > > easier. Does that make sense?
> > > > > >
> > > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > > path, and there's no need to drop the lock.
> > > > >
> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > > too after this.
> > > >
> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > > in swapin_readahead()?  It all looks like async I/O to me.
> > >
> > > Hmm. I thought that we have synchronous I/O in the following paths:
> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > > but just noticed that in both cases swap_readpage() is called with the
> > > synchronous parameter being false. So you are probably right here...
> >
> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
> > like if the readahead window is 1 (aka we are not reading ahead), then
> > we jump to directly calling read_swap_cache_async() passing do_poll =
> > true, which means we may end up calling swap_readpage() passing
> > synchronous = true.
> >
> > I am not familiar with readahead heuristics, so I am not sure how
> > common this is, but it's something to think about.
>
> Uh, you are correct. If this branch is common, we could use the same
> "drop the lock and retry" pattern inside read_swap_cache_async(). That
> would be quite easy to implement.
> Thanks for checking on it!


I am honestly not sure how common this is.

+Ying who might have a better idea.


>
>
> >
> > > Does that mean swapin_readahead() might return a page which does not
> > > have its content swapped-in yet?
> > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-03 20:57                     ` Yosry Ahmed
@ 2023-05-05  5:02                       ` Huang, Ying
  2023-05-05 22:30                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: Huang, Ying @ 2023-05-05  5:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, hannes, mhocko, josef,
	jack, ldufour, laurent.dufour, michel, liam.howlett, jglisse,
	vbabka, minchan, dave, punit.agrawal, lstoakes, hdanton, apopple,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team, Ming Lei

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>> >
>> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>> > >
>> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > >
>> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
>> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > >
>> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
>> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > > > >
>> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
>> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > > > > > >
>> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
>> > > > > > > > > > > +++ b/mm/memory.c
>> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > > > > > > > > > >       if (!pte_unmap_same(vmf))
>> > > > > > > > > > >               goto out;
>> > > > > > > > > > >
>> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
>> > > > > > > > > > > -             goto out;
>> > > > > > > > > > > -     }
>> > > > > > > > > > > -
>> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
>> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
>> > > > > > > > > > >               if (is_migration_entry(entry)) {
>> > > > > > > > > >
>> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
>> > > > > > > > > > swap_readpage() is synchronous and will sleep.
>> > > > > > > > >
>> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
>> > > > > > > > > under mmap_lock?
>> > > > > > > >
>> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
>> > > > > > > > the VMA lock held?
>> > > > > > >
>> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
>> > > > > > > the lock being held. I argued against it because currently we drop
>> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
>> > > > > > > be changing the current behavior which might introduce new
>> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
>> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
>> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
>> > > > > > >
>> > > > > > > That said, you are absolutely correct that this situation can be
>> > > > > > > improved by dropping the lock in these cases too. I just didn't want
>> > > > > > > to attack everything at once. I believe after we agree on the approach
>> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
>> > > > > > > for dropping the VMA lock before waiting, these cases can be added
>> > > > > > > easier. Does that make sense?
>> > > > > >
>> > > > > > OK, I looked at this path some more, and I think we're fine.  This
>> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
>> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
>> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
>> > > > > > path, and there's no need to drop the lock.
>> > > > >
>> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
>> > > > > too after this.
>> > > >
>> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
>> > > > in swapin_readahead()?  It all looks like async I/O to me.
>> > >
>> > > Hmm. I thought that we have synchronous I/O in the following paths:
>> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
>> > > but just noticed that in both cases swap_readpage() is called with the
>> > > synchronous parameter being false. So you are probably right here...
>> >
>> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
>> > like if the readahead window is 1 (aka we are not reading ahead), then
>> > we jump to directly calling read_swap_cache_async() passing do_poll =
>> > true, which means we may end up calling swap_readpage() passing
>> > synchronous = true.
>> >
>> > I am not familiar with readahead heuristics, so I am not sure how
>> > common this is, but it's something to think about.
>>
>> Uh, you are correct. If this branch is common, we could use the same
>> "drop the lock and retry" pattern inside read_swap_cache_async(). That
>> would be quite easy to implement.
>> Thanks for checking on it!
>
>
> I am honestly not sure how common this is.
>
> +Ying who might have a better idea.

Checked the code and related history.  It seems that we can just pass
"synchronous = false" to swap_readpage() in read_swap_cache_async().
"synchronous = true" was introduced in commit 23955622ff8d ("swap: add
block io poll in swapin path") to reduce swap read latency for block
devices that can be polled.  But in commit 9650b453a3d4 ("block: ignore
RWF_HIPRI hint for sync dio"), the polling is deleted.  So, we don't
need to pass "synchronous = true" to swap_readpage() during
swapin_readahead(), because we will wait the IO to complete in
folio_lock_or_retry().

Best Regards,
Huang, Ying

>>
>>
>> >
>> > > Does that mean swapin_readahead() might return a page which does not
>> > > have its content swapped-in yet?
>> > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended
  2023-05-05  5:02                       ` Huang, Ying
@ 2023-05-05 22:30                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2023-05-05 22:30 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, Matthew Wilcox, akpm, hannes, mhocko, josef, jack,
	ldufour, laurent.dufour, michel, liam.howlett, jglisse, vbabka,
	minchan, dave, punit.agrawal, lstoakes, hdanton, apopple,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team, Ming Lei

On Thu, May 4, 2023 at 10:03 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> >
> >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >> > >
> >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > >
> >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > >
> >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > > > >
> >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > > > > > +++ b/mm/memory.c
> >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > > > > > > > > > >       if (!pte_unmap_same(vmf))
> >> > > > > > > > > > >               goto out;
> >> > > > > > > > > > >
> >> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> >> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
> >> > > > > > > > > > > -             goto out;
> >> > > > > > > > > > > -     }
> >> > > > > > > > > > > -
> >> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> >> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> >> > > > > > > > > > >               if (is_migration_entry(entry)) {
> >> > > > > > > > > >
> >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> >> > > > > > > > > > swap_readpage() is synchronous and will sleep.
> >> > > > > > > > >
> >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> >> > > > > > > > > under mmap_lock?
> >> > > > > > > >
> >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> >> > > > > > > > the VMA lock held?
> >> > > > > > >
> >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> >> > > > > > > the lock being held. I argued against it because currently we drop
> >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> >> > > > > > > be changing the current behavior which might introduce new
> >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> >> > > > > > >
> >> > > > > > > That said, you are absolutely correct that this situation can be
> >> > > > > > > improved by dropping the lock in these cases too. I just didn't want
> >> > > > > > > to attack everything at once. I believe after we agree on the approach
> >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> >> > > > > > > for dropping the VMA lock before waiting, these cases can be added
> >> > > > > > > easier. Does that make sense?
> >> > > > > >
> >> > > > > > OK, I looked at this path some more, and I think we're fine.  This
> >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> >> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> >> > > > > > path, and there's no need to drop the lock.
> >> > > > >
> >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> >> > > > > too after this.
> >> > > >
> >> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> >> > > > in swapin_readahead()?  It all looks like async I/O to me.
> >> > >
> >> > > Hmm. I thought that we have synchronous I/O in the following paths:
> >> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> >> > > but just noticed that in both cases swap_readpage() is called with the
> >> > > synchronous parameter being false. So you are probably right here...
> >> >
> >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
> >> > like if the readahead window is 1 (aka we are not reading ahead), then
> >> > we jump to directly calling read_swap_cache_async() passing do_poll =
> >> > true, which means we may end up calling swap_readpage() passing
> >> > synchronous = true.
> >> >
> >> > I am not familiar with readahead heuristics, so I am not sure how
> >> > common this is, but it's something to think about.
> >>
> >> Uh, you are correct. If this branch is common, we could use the same
> >> "drop the lock and retry" pattern inside read_swap_cache_async(). That
> >> would be quite easy to implement.
> >> Thanks for checking on it!
> >
> >
> > I am honestly not sure how common this is.
> >
> > +Ying who might have a better idea.
>
> Checked the code and related history.  It seems that we can just pass
> "synchronous = false" to swap_readpage() in read_swap_cache_async().
> "synchronous = true" was introduced in commit 23955622ff8d ("swap: add
> block io poll in swapin path") to reduce swap read latency for block
> devices that can be polled.  But in commit 9650b453a3d4 ("block: ignore
> RWF_HIPRI hint for sync dio"), the polling is deleted.  So, we don't
> need to pass "synchronous = true" to swap_readpage() during
> swapin_readahead(), because we will wait the IO to complete in
> folio_lock_or_retry().

Thanks for investigating, Ying! It sounds like we can make some
simplifications here. I'll double-check and if I don't find anything
else, will change to "synchronous = false" in the next version of the
patchset.

>
> Best Regards,
> Huang, Ying
>
> >>
> >>
> >> >
> >> > > Does that mean swapin_readahead() might return a page which does not
> >> > > have its content swapped-in yet?
> >> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-05-05 22:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 17:50 [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Suren Baghdasaryan
2023-05-01 17:50 ` [PATCH 2/3] mm: drop VMA lock before waiting for migration Suren Baghdasaryan
2023-05-02 13:21   ` Alistair Popple
2023-05-02 16:39     ` Suren Baghdasaryan
2023-05-03 13:03       ` Alistair Popple
2023-05-03 19:42         ` Suren Baghdasaryan
2023-05-02 14:28   ` Matthew Wilcox
2023-05-02 16:41     ` Suren Baghdasaryan
2023-05-01 17:50 ` [PATCH 3/3] mm: implement folio wait under VMA lock Suren Baghdasaryan
2023-05-02  2:02 ` [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended Matthew Wilcox
2023-05-02  2:30   ` Suren Baghdasaryan
2023-05-02  3:22     ` Matthew Wilcox
2023-05-02  5:04       ` Suren Baghdasaryan
2023-05-02 15:03         ` Matthew Wilcox
2023-05-02 16:36           ` Suren Baghdasaryan
2023-05-02 22:31             ` Matthew Wilcox
2023-05-02 23:04               ` Suren Baghdasaryan
2023-05-02 23:40                 ` Matthew Wilcox
2023-05-03  1:05                   ` Suren Baghdasaryan
2023-05-03  8:34                 ` Yosry Ahmed
2023-05-03 19:57                   ` Suren Baghdasaryan
2023-05-03 20:57                     ` Yosry Ahmed
2023-05-05  5:02                       ` Huang, Ying
2023-05-05 22:30                         ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).