All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Will Deacon <will@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Jan Kara <jack@suse.cz>, Minchan Kim <minchan@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	Android Kernel Team <kernel-team@android.com>
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting
Date: Wed, 23 Dec 2020 20:04:54 -0800 (PST)	[thread overview]
Message-ID: <alpine.LSU.2.11.2012231905300.5723@eggly.anvils> (raw)
In-Reply-To: <20201222100047.p5zdb4ghagncq2oe@box>

On Tue, 22 Dec 2020, Kirill A. Shutemov wrote:
> 
> Updated patch is below.
> 
> From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
> 
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
> 
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
> 
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
> 
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

It's not ready yet.

I won't pretend to have reviewed, but I did try applying and running
with it: mostly it seems to work fine, but turned out to be leaking
huge pages (with vmstat's thp_split_page_failed growing bigger and
bigger as page reclaim cannot get rid of them).

Aside from the actual bug, filemap_map_pmd() seems suboptimal at
present: comments below (plus one comment in do_anonymous_page()).

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b2067b3c328..f8fdbe079375 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
> +				  struct xa_state *xas)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +
> +	/* Huge page is mapped? No need to proceed. */
> +	if (pmd_trans_huge(*vmf->pmd))
> +		return true;
> +
> +	if (xa_is_value(page))
> +		goto nohuge;

I think it would be easier to follow if filemap_map_pages() never
passed this an xa_is_value(page): probably just skip them in its
initial xas_next_entry() loop.

> +
> +	if (!pmd_none(*vmf->pmd))
> +		goto nohuge;

Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none,
and unlocks again: unnecessary overhead I believe we did not have before.

> +
> +	if (!PageTransHuge(page) || PageLocked(page))
> +		goto nohuge;

So if PageTransHuge, but someone else temporarily holds PageLocked,
we insert a page table at nohuge, sadly preventing it from being
mapped here later by huge pmd.

> +
> +	if (!page_cache_get_speculative(page))
> +		goto nohuge;
> +
> +	if (page != xas_reload(xas))
> +		goto unref;
> +
> +	if (!PageTransHuge(page))
> +		goto unref;
> +
> +	if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
> +		goto unref;
> +
> +	if (!trylock_page(page))
> +		goto unref;
> +
> +	if (page->mapping != mapping || !PageUptodate(page))
> +		goto unlock;
> +
> +	if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
> +		goto unlock;
> +
> +	do_set_pmd(vmf, page);

Here is the source of the huge page leak: do_set_pmd() can fail
(and we would do better to have skipped most of its failure cases long
before getting this far).  It worked without leaking once I patched it:

-	do_set_pmd(vmf, page);
-	unlock_page(page);
-	return true;
+	if (do_set_pmd(vmf, page) == 0) {
+		unlock_page(page);
+		return true;
+	}

> +	unlock_page(page);
> +	return true;
> +unlock:
> +	unlock_page(page);
> +unref:
> +	put_page(page);
> +nohuge:
> +	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +	if (likely(pmd_none(*vmf->pmd))) {
> +		mm_inc_nr_ptes(vma->vm_mm);
> +		pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> +		vmf->prealloc_pte = NULL;
> +	}
> +	spin_unlock(vmf->ptl);

I think it's a bit weird to hide this page table insertion inside
filemap_map_pmd() (I guess you're thinking that this function deals
with pmd level, but I'd find it easier to have a filemap_map_huge()
dealing with the huge mapping).  Better to do it on return into
filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge()
would then need to return vm_fault_t rather than bool, I didn't try.

> +
> +	/* See comment in handle_pte_fault() */
> +	if (pmd_devmap_trans_unstable(vmf->pmd))
> +		return true;
> +
> +	return false;
> +}
...
> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e502..96d62774096a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (pte_alloc(vma->vm_mm, vmf->pmd))
>  		return VM_FAULT_OOM;
>  
> -	/* See the comment in pte_alloc_one_map() */
> +	/* See the comment in map_set_pte() */

No, no such function: probably should be like the others and say
	/* See comment in handle_pte_fault() */

Hugh

WARNING: multiple messages have this Message-ID (diff)
From: Hugh Dickins <hughd@google.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Android Kernel Team <kernel-team@android.com>,
	Jan Kara <jack@suse.cz>, Minchan Kim <minchan@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Linux-MM <linux-mm@kvack.org>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting
Date: Wed, 23 Dec 2020 20:04:54 -0800 (PST)	[thread overview]
Message-ID: <alpine.LSU.2.11.2012231905300.5723@eggly.anvils> (raw)
In-Reply-To: <20201222100047.p5zdb4ghagncq2oe@box>

On Tue, 22 Dec 2020, Kirill A. Shutemov wrote:
> 
> Updated patch is below.
> 
> From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
> 
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
> 
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
> 
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
> 
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

It's not ready yet.

I won't pretend to have reviewed, but I did try applying and running
with it: mostly it seems to work fine, but turned out to be leaking
huge pages (with vmstat's thp_split_page_failed growing bigger and
bigger as page reclaim cannot get rid of them).

Aside from the actual bug, filemap_map_pmd() seems suboptimal at
present: comments below (plus one comment in do_anonymous_page()).

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b2067b3c328..f8fdbe079375 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
> +				  struct xa_state *xas)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +
> +	/* Huge page is mapped? No need to proceed. */
> +	if (pmd_trans_huge(*vmf->pmd))
> +		return true;
> +
> +	if (xa_is_value(page))
> +		goto nohuge;

I think it would be easier to follow if filemap_map_pages() never
passed this an xa_is_value(page): probably just skip them in its
initial xas_next_entry() loop.

> +
> +	if (!pmd_none(*vmf->pmd))
> +		goto nohuge;

Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none,
and unlocks again: unnecessary overhead I believe we did not have before.

> +
> +	if (!PageTransHuge(page) || PageLocked(page))
> +		goto nohuge;

So if PageTransHuge, but someone else temporarily holds PageLocked,
we insert a page table at nohuge, sadly preventing it from being
mapped here later by huge pmd.

> +
> +	if (!page_cache_get_speculative(page))
> +		goto nohuge;
> +
> +	if (page != xas_reload(xas))
> +		goto unref;
> +
> +	if (!PageTransHuge(page))
> +		goto unref;
> +
> +	if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
> +		goto unref;
> +
> +	if (!trylock_page(page))
> +		goto unref;
> +
> +	if (page->mapping != mapping || !PageUptodate(page))
> +		goto unlock;
> +
> +	if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
> +		goto unlock;
> +
> +	do_set_pmd(vmf, page);

Here is the source of the huge page leak: do_set_pmd() can fail
(and we would do better to have skipped most of its failure cases long
before getting this far).  It worked without leaking once I patched it:

-	do_set_pmd(vmf, page);
-	unlock_page(page);
-	return true;
+	if (do_set_pmd(vmf, page) == 0) {
+		unlock_page(page);
+		return true;
+	}

> +	unlock_page(page);
> +	return true;
> +unlock:
> +	unlock_page(page);
> +unref:
> +	put_page(page);
> +nohuge:
> +	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +	if (likely(pmd_none(*vmf->pmd))) {
> +		mm_inc_nr_ptes(vma->vm_mm);
> +		pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> +		vmf->prealloc_pte = NULL;
> +	}
> +	spin_unlock(vmf->ptl);

I think it's a bit weird to hide this page table insertion inside
filemap_map_pmd() (I guess you're thinking that this function deals
with pmd level, but I'd find it easier to have a filemap_map_huge()
dealing with the huge mapping).  Better to do it on return into
filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge()
would then need to return vm_fault_t rather than bool, I didn't try.

> +
> +	/* See comment in handle_pte_fault() */
> +	if (pmd_devmap_trans_unstable(vmf->pmd))
> +		return true;
> +
> +	return false;
> +}
...
> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e502..96d62774096a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (pte_alloc(vma->vm_mm, vmf->pmd))
>  		return VM_FAULT_OOM;
>  
> -	/* See the comment in pte_alloc_one_map() */
> +	/* See the comment in map_set_pte() */

No, no such function: probably should be like the others and say
	/* See comment in handle_pte_fault() */

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2020-12-24  4:06 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09 16:39 [PATCH 0/2] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag Will Deacon
2020-12-09 16:39 ` Will Deacon
2020-12-09 16:39 ` [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting Will Deacon
2020-12-09 16:39   ` Will Deacon
2020-12-09 17:58   ` Linus Torvalds
2020-12-09 17:58     ` Linus Torvalds
2020-12-09 17:58     ` Linus Torvalds
2020-12-09 18:40     ` Will Deacon
2020-12-09 18:40       ` Will Deacon
2020-12-09 19:04       ` Linus Torvalds
2020-12-09 19:04         ` Linus Torvalds
2020-12-09 19:04         ` Linus Torvalds
2020-12-09 20:32         ` Matthew Wilcox
2020-12-09 20:32           ` Matthew Wilcox
2020-12-09 21:04           ` Linus Torvalds
2020-12-09 21:04             ` Linus Torvalds
2020-12-09 21:04             ` Linus Torvalds
2020-12-10 15:08         ` Kirill A. Shutemov
2020-12-10 15:08           ` Kirill A. Shutemov
2020-12-10 17:23           ` Linus Torvalds
2020-12-10 17:23             ` Linus Torvalds
2020-12-10 17:23             ` Linus Torvalds
2020-12-14 16:07             ` Kirill A. Shutemov
2020-12-14 16:07               ` Kirill A. Shutemov
2020-12-14 17:54               ` Linus Torvalds
2020-12-14 17:54                 ` Linus Torvalds
2020-12-14 17:54                 ` Linus Torvalds
2020-12-14 18:56                 ` Matthew Wilcox
2020-12-14 18:56                   ` Matthew Wilcox
2020-12-16 17:07                 ` Kirill A. Shutemov
2020-12-16 17:07                   ` Kirill A. Shutemov
2020-12-16 18:41                   ` Linus Torvalds
2020-12-16 18:41                     ` Linus Torvalds
2020-12-16 18:41                     ` Linus Torvalds
2020-12-17 10:54                     ` Kirill A. Shutemov
2020-12-17 10:54                       ` Kirill A. Shutemov
2020-12-17 18:22                       ` Linus Torvalds
2020-12-17 18:22                         ` Linus Torvalds
2020-12-17 18:22                         ` Linus Torvalds
2020-12-18 11:04                         ` Kirill A. Shutemov
2020-12-18 11:04                           ` Kirill A. Shutemov
2020-12-18 18:56                           ` Linus Torvalds
2020-12-18 18:56                             ` Linus Torvalds
2020-12-18 18:56                             ` Linus Torvalds
2020-12-19 12:41                             ` Kirill A. Shutemov
2020-12-19 12:41                               ` Kirill A. Shutemov
2020-12-19 20:08                               ` Linus Torvalds
2020-12-19 20:08                                 ` Linus Torvalds
2020-12-19 20:08                                 ` Linus Torvalds
2020-12-19 20:34                               ` Linus Torvalds
2020-12-19 20:34                                 ` Linus Torvalds
2020-12-19 20:34                                 ` Linus Torvalds
2020-12-22 10:00                                 ` Kirill A. Shutemov
2020-12-22 10:00                                   ` Kirill A. Shutemov
2020-12-24  4:04                                   ` Hugh Dickins [this message]
2020-12-24  4:04                                     ` Hugh Dickins
2020-12-24  4:04                                     ` Hugh Dickins
2020-12-25 11:31                                     ` Kirill A. Shutemov
2020-12-25 11:31                                       ` Kirill A. Shutemov
2020-12-26 17:57                                       ` Linus Torvalds
2020-12-26 17:57                                         ` Linus Torvalds
2020-12-26 17:57                                         ` Linus Torvalds
2020-12-26 20:43                                         ` Kirill A. Shutemov
2020-12-26 20:43                                           ` Kirill A. Shutemov
2020-12-26 21:03                                           ` Hugh Dickins
2020-12-26 21:03                                             ` Hugh Dickins
2020-12-26 21:03                                             ` Hugh Dickins
2020-12-26 21:16                                             ` Linus Torvalds
2020-12-26 21:16                                               ` Linus Torvalds
2020-12-26 21:16                                               ` Linus Torvalds
2020-12-26 22:40                                               ` Kirill A. Shutemov
2020-12-26 22:40                                                 ` Kirill A. Shutemov
2020-12-27  0:45                                                 ` Hugh Dickins
2020-12-27  0:45                                                   ` Hugh Dickins
2020-12-27  0:45                                                   ` Hugh Dickins
2020-12-27  2:38                                                   ` Hugh Dickins
2020-12-27  2:38                                                     ` Hugh Dickins
2020-12-27  2:38                                                     ` Hugh Dickins
2020-12-27 19:38                                                     ` Linus Torvalds
2020-12-27 19:38                                                       ` Linus Torvalds
2020-12-27 19:38                                                       ` Linus Torvalds
2020-12-27 20:32                                                       ` Damian Tometzki
2020-12-27 20:32                                                         ` Damian Tometzki
2020-12-27 22:35                                                         ` Hugh Dickins
2020-12-27 22:35                                                           ` Hugh Dickins
2020-12-27 22:35                                                           ` Hugh Dickins
2020-12-27 23:12                                                           ` Linus Torvalds
2020-12-27 23:12                                                             ` Linus Torvalds
2020-12-27 23:12                                                             ` Linus Torvalds
2020-12-27 23:40                                                             ` Linus Torvalds
2020-12-27 23:40                                                               ` Linus Torvalds
2020-12-27 23:40                                                               ` Linus Torvalds
2020-12-27 23:55                                                               ` Kirill A. Shutemov
2020-12-27 23:55                                                                 ` Kirill A. Shutemov
2020-12-27 23:48                                                       ` Kirill A. Shutemov
2020-12-27 23:48                                                         ` Kirill A. Shutemov
2020-12-28  1:54                                                         ` Linus Torvalds
2020-12-28  1:54                                                           ` Linus Torvalds
2020-12-28  1:54                                                           ` Linus Torvalds
2020-12-28  6:43                                                           ` Hugh Dickins
2020-12-28  6:43                                                             ` Hugh Dickins
2020-12-28  6:43                                                             ` Hugh Dickins
2020-12-28 12:53                                                             ` Kirill A. Shutemov
2020-12-28 12:53                                                               ` Kirill A. Shutemov
2020-12-28 18:47                                                               ` Linus Torvalds
2020-12-28 18:47                                                                 ` Linus Torvalds
2020-12-28 18:47                                                                 ` Linus Torvalds
2020-12-28 21:58                                                                 ` Linus Torvalds
2020-12-28 21:58                                                                   ` Linus Torvalds
2020-12-28 21:58                                                                   ` Linus Torvalds
2020-12-29 13:28                                                                   ` Kirill A. Shutemov
2020-12-29 13:28                                                                     ` Kirill A. Shutemov
2020-12-29 15:19                                                                     ` Matthew Wilcox
2020-12-29 15:19                                                                       ` Matthew Wilcox
2020-12-29 20:52                                                                     ` Linus Torvalds
2020-12-29 20:52                                                                       ` Linus Torvalds
2020-12-29 20:52                                                                       ` Linus Torvalds
2020-12-28 22:05                                                                 ` Kirill A. Shutemov
2020-12-28 22:05                                                                   ` Kirill A. Shutemov
2020-12-28 22:12                                                                   ` Kirill A. Shutemov
2020-12-28 22:12                                                                     ` Kirill A. Shutemov
2020-12-29  4:35                                                                     ` Hugh Dickins
2020-12-29  4:35                                                                       ` Hugh Dickins
2020-12-29  4:35                                                                       ` Hugh Dickins
2020-12-28 23:28                                                                   ` Linus Torvalds
2020-12-28 23:28                                                                     ` Linus Torvalds
2020-12-28 23:28                                                                     ` Linus Torvalds
2020-12-26 21:07                                           ` Linus Torvalds
2020-12-26 21:07                                             ` Linus Torvalds
2020-12-26 21:07                                             ` Linus Torvalds
2020-12-26 21:41                                           ` Matthew Wilcox
2020-12-26 21:41                                             ` Matthew Wilcox
2020-12-09 16:39 ` [PATCH 2/2] arm64: mm: Implement arch_wants_old_faultaround_pte() Will Deacon
2020-12-09 16:39   ` Will Deacon
2020-12-09 18:35   ` Catalin Marinas
2020-12-09 18:35     ` Catalin Marinas
2020-12-09 18:46     ` Will Deacon
2020-12-09 18:46       ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.2012231905300.5723@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=jack@suse.cz \
    --cc=kernel-team@android.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vinmenon@codeaurora.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.