Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages

From: Peter Xu <peterx@redhat.com>
To: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Alistair Popple <apopple@nvidia.com>,
	Ivan Teterevkov <ivan.teterevkov@nutanix.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
Date: Fri, 13 Aug 2021 12:01:42 -0400	[thread overview]
Message-ID: <YRaXZgBuoYX0sPhS@xz-m1.local> (raw)
In-Reply-To: <C0DB3FED-F779-4838-9697-D05BE96C3514@nutanix.com>

On Fri, Aug 13, 2021 at 03:18:22PM +0000, Tiberiu Georgescu wrote:
> Hello Peter,

Hi, Tiberiu,

> 
> Sorry for commenting so late.

No worry on that.  I appreciate a lot your follow up on this problem.

> 
> > On 7 Aug 2021, at 04:25, Peter Xu <peterx@redhat.com> wrote:
> > 
> > When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> > marker pte showing that this page is swapped out as a hint for pagemap.  A new
> > TTU flag is introduced to identify this case.
> > 
> > This can be useful for detecting swapped out cold shmem pages.  Then after some
> > memory background scanning work (which will fault in the shmem page and
> > confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> > it out again as we know it was cold.
> > 
> > For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> > SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > fs/proc/task_mmu.c   |  1 +
> > include/linux/rmap.h |  1 +
> > mm/rmap.c            | 19 +++++++++++++++++++
> > mm/vmscan.c          |  2 +-
> > 4 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index eb97468dfe4c..21b8594abc1d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > 		if (pm->show_pfn)
> > 			frame = swp_type(entry) |
> > 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > +		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
> > 		flags |= PM_SWAP;
> > 		if (is_pfn_swap_entry(entry))
> > 			page = pfn_swap_entry_to_page(entry);
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index c976cc6de257..318a0e95c7fb 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -95,6 +95,7 @@ enum ttu_flags {
> > 					 * do a final flush if necessary */
> > 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> > 					 * caller holds it */
> > +	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
> > };
> > 
> > #ifdef CONFIG_MMU
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b9eb5c12f3fe..24a70b36b6da 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> > 	unlock_page_memcg(page);
> > }
> > 
> > +static inline void
> > +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> > +		   struct page *page, unsigned long address)
> > +{
> > +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> > +	swp_entry_t entry;
> > +	pte_t pteval;
> > +
> > +	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> > +		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> > +		pteval = swp_entry_to_pte(entry);
> > +		set_pte_at(vma->vm_mm, address, pte, pteval);
> > +	}
> > +#endif
> > +}
> > +
> > /*
> >  * @arg: enum ttu_flags will be passed to this argument
> >  */
> > @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 			 */
> > 			dec_mm_counter(mm, mm_counter_file(page));
> > 		}
> > +
> > +		if (flags & TTU_HINT_PAGEOUT)
> > +			pte_marker_install(vma, pvmw.pte, page, address);
> > discard:
> > 		/*
> > 		 * No need to call mmu_notifier_invalidate_range() it has be
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4620df62f0ff..4754af6fa24b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> > 		 * processes. Try to unmap it here.
> > 		 */
> > 		if (page_mapped(page)) {
> > -			enum ttu_flags flags = TTU_BATCH_FLUSH;
> > +			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> > 			bool was_swapbacked = PageSwapBacked(page);
> > 
> > 			if (unlikely(PageTransHuge(page)))
> > -- 
> > 2.32.0
> > 
> The RFC looks good to me. It makes it much cleaner to verify whether the page
> is in swap or not, and there is great performance benefit compared to my patch.
> 
> Currently, your patch does not have a way of putting the swap offset onto the
> entry when a shared page is swapped, so it does not cover all use cases
> IMO.

Could you remind me on why you need the swap offset?  I was trying to
understand your scenaior and I hope I summarized it right: what we want to do
is being able to MADV_PAGEOUT pages that was swapped out.  Then IIUC the
PM_SWAP bit should be enough for it.  Did I miss something important?

> 
> To counter that, I added my patch on top of yours. By making use of your
> mechanism, we don't need to read the page cache xarray when we pages
> are definitely none. This reduces much unnecessary overhead.
> 
> Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
>         } else if (is_swap_pte(pte)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(pte);
> +               if (is_pte_marker_entry(entry)) {
> +                       void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>                 ...
> 
> For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/.
> 
> After some preliminary performance tests on VMs, I noticed a significant
> performance improvement in my patch, when yours is added. Here is the
> most interesting result:
> 
> Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> 
> Specs:
>     Memory: 32GB memory, 64GB swap
>     Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching
> 
> In short, the pagemap read deals with 
>     * ~4GB of physical memory (present pages), 
>     * ~4GB in swap (swapped pages)
>     * 8GB non-dirty (none pages).
> 
> Results:
> +------------+-------+-------+
> | Peter\Tibi |  Pre  | Post  | (in seconds)
> +------------+-------+-------+
> | Pre        | 11.82 | 38.44 |
> | Post       | 13.28 | 22.34 |
> +------------+-------+-------+
> 
> With your patch, mine yields the results in twice the time, compared to 4x.
> I am aware it is not ideal, but until it is decided whether a whole different
> system is preferred to PTE markers, our solutions together currently deliver
> the best performance for correctness. Thank you for finding this solution.

Thanks for trying that out!  It's great to know these test results.

> 
> I am happy to introduce a configuration variable to enable my pagemap
> patch only if necessary.

Right.  We can use a config for that, but I didn't mention when replying to
your previous patchset because I thought w/ or w/o the config it should still
be better to use the pte markers because it should be more efficient.

However I think I need to double check I didn't miss anything that you're
looking for. E.g. I need to understand why swp offset matters here as I asked.
I must confess that cannot be trivially done with pte markers yet - keeping a
swap hint is definitely not the same story as keeping a swp entry.

> Either way, if performance is a must, batching is still the best way to
> access multiple pagemap entries.

I agree, especially when we have pmd pgtable locks things can happen
concurrently.

It's just that it's a pity the major overhead comparing to the old way is at
page cache look up, especially as you pointed out - the capability to identify
used ptes with empty ptes matters.  That's kind of orthogonal to batching to me.

Thanks,

-- 
Peter Xu