All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Alistair Popple <apopple@nvidia.com>,
	Ivan Teterevkov <ivan.teterevkov@nutanix.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
Date: Fri, 13 Aug 2021 12:01:42 -0400	[thread overview]
Message-ID: <YRaXZgBuoYX0sPhS@xz-m1.local> (raw)
In-Reply-To: <C0DB3FED-F779-4838-9697-D05BE96C3514@nutanix.com>

On Fri, Aug 13, 2021 at 03:18:22PM +0000, Tiberiu Georgescu wrote:
> Hello Peter,

Hi, Tiberiu,

> 
> Sorry for commenting so late.

No worry on that.  I appreciate a lot your follow up on this problem.

> 
> > On 7 Aug 2021, at 04:25, Peter Xu <peterx@redhat.com> wrote:
> > 
> > When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> > marker pte showing that this page is swapped out as a hint for pagemap.  A new
> > TTU flag is introduced to identify this case.
> > 
> > This can be useful for detecting swapped out cold shmem pages.  Then after some
> > memory background scanning work (which will fault in the shmem page and
> > confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> > it out again as we know it was cold.
> > 
> > For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> > SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > fs/proc/task_mmu.c   |  1 +
> > include/linux/rmap.h |  1 +
> > mm/rmap.c            | 19 +++++++++++++++++++
> > mm/vmscan.c          |  2 +-
> > 4 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index eb97468dfe4c..21b8594abc1d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > 		if (pm->show_pfn)
> > 			frame = swp_type(entry) |
> > 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > +		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
> > 		flags |= PM_SWAP;
> > 		if (is_pfn_swap_entry(entry))
> > 			page = pfn_swap_entry_to_page(entry);
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index c976cc6de257..318a0e95c7fb 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -95,6 +95,7 @@ enum ttu_flags {
> > 					 * do a final flush if necessary */
> > 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> > 					 * caller holds it */
> > +	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
> > };
> > 
> > #ifdef CONFIG_MMU
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b9eb5c12f3fe..24a70b36b6da 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> > 	unlock_page_memcg(page);
> > }
> > 
> > +static inline void
> > +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> > +		   struct page *page, unsigned long address)
> > +{
> > +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> > +	swp_entry_t entry;
> > +	pte_t pteval;
> > +
> > +	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> > +		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> > +		pteval = swp_entry_to_pte(entry);
> > +		set_pte_at(vma->vm_mm, address, pte, pteval);
> > +	}
> > +#endif
> > +}
> > +
> > /*
> >  * @arg: enum ttu_flags will be passed to this argument
> >  */
> > @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 			 */
> > 			dec_mm_counter(mm, mm_counter_file(page));
> > 		}
> > +
> > +		if (flags & TTU_HINT_PAGEOUT)
> > +			pte_marker_install(vma, pvmw.pte, page, address);
> > discard:
> > 		/*
> > 		 * No need to call mmu_notifier_invalidate_range() it has be
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4620df62f0ff..4754af6fa24b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> > 		 * processes. Try to unmap it here.
> > 		 */
> > 		if (page_mapped(page)) {
> > -			enum ttu_flags flags = TTU_BATCH_FLUSH;
> > +			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> > 			bool was_swapbacked = PageSwapBacked(page);
> > 
> > 			if (unlikely(PageTransHuge(page)))
> > -- 
> > 2.32.0
> > 
> The RFC looks good to me. It makes it much cleaner to verify whether the page
> is in swap or not, and there is great performance benefit compared to my patch.
> 
> Currently, your patch does not have a way of putting the swap offset onto the
> entry when a shared page is swapped, so it does not cover all use cases
> IMO.

Could you remind me on why you need the swap offset?  I was trying to
understand your scenaior and I hope I summarized it right: what we want to do
is being able to MADV_PAGEOUT pages that was swapped out.  Then IIUC the
PM_SWAP bit should be enough for it.  Did I miss something important?

> 
> To counter that, I added my patch on top of yours. By making use of your
> mechanism, we don't need to read the page cache xarray when we pages
> are definitely none. This reduces much unnecessary overhead.
> 
> Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
>         } else if (is_swap_pte(pte)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(pte);
> +               if (is_pte_marker_entry(entry)) {
> +                       void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>                 ...
> 
> For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/.
> 
> After some preliminary performance tests on VMs, I noticed a significant
> performance improvement in my patch, when yours is added. Here is the
> most interesting result:
> 
> Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> 
> Specs:
>     Memory: 32GB memory, 64GB swap
>     Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching
> 
> In short, the pagemap read deals with 
>     * ~4GB of physical memory (present pages), 
>     * ~4GB in swap (swapped pages)
>     * 8GB non-dirty (none pages).
> 
> Results:
> +------------+-------+-------+
> | Peter\Tibi |  Pre  | Post  | (in seconds)
> +------------+-------+-------+
> | Pre        | 11.82 | 38.44 |
> | Post       | 13.28 | 22.34 |
> +------------+-------+-------+
> 
> With your patch, mine yields the results in twice the time, compared to 4x.
> I am aware it is not ideal, but until it is decided whether a whole different
> system is preferred to PTE markers, our solutions together currently deliver
> the best performance for correctness. Thank you for finding this solution.

Thanks for trying that out!  It's great to know these test results.

> 
> I am happy to introduce a configuration variable to enable my pagemap
> patch only if necessary.

Right.  We can use a config for that, but I didn't mention when replying to
your previous patchset because I thought w/ or w/o the config it should still
be better to use the pte markers because it should be more efficient.

However I think I need to double check I didn't miss anything that you're
looking for. E.g. I need to understand why swp offset matters here as I asked.
I must confess that cannot be trivially done with pte markers yet - keeping a
swap hint is definitely not the same story as keeping a swp entry.

> Either way, if performance is a must, batching is still the best way to
> access multiple pagemap entries.

I agree, especially when we have pmd pgtable locks things can happen
concurrently.

It's just that it's a pity the major overhead comparing to the old way is at
page cache look up, especially as you pointed out - the capability to identify
used ptes with empty ptes matters.  That's kind of orthogonal to batching to me.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2021-08-13 16:01 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
2021-08-07  3:25 ` [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault() Peter Xu
2021-08-07  3:25 ` [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults Peter Xu
2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
2021-08-13 15:18   ` Tiberiu Georgescu
2021-08-13 16:01     ` Peter Xu [this message]
2021-08-18 18:02       ` Tiberiu Georgescu
2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
2021-08-17 17:09   ` Peter Xu
2021-08-17 18:46     ` David Hildenbrand
2021-08-17 20:24       ` Peter Xu
2021-08-18  8:24         ` David Hildenbrand
2021-08-18 17:52           ` Tiberiu Georgescu
2021-08-18 18:13             ` David Hildenbrand
2021-08-19 14:54               ` Tiberiu Georgescu
2021-08-19 17:26                 ` David Hildenbrand
2021-08-20 16:49                   ` Tiberiu Georgescu
2021-08-20 19:12                     ` Peter Xu
2021-08-25 13:40                       ` Tiberiu Georgescu
2021-08-25 14:59                         ` Peter Xu
2021-08-26 15:01                           ` Tiberiu Georgescu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YRaXZgBuoYX0sPhS@xz-m1.local \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=tiberiu.georgescu@nutanix.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.