Re: [PATCH 2/2] mm: Remember young bit for page migrations

From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	Andi Kleen <andi.kleen@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Huang Ying <ying.huang@intel.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH 2/2] mm: Remember young bit for page migrations
Date: Wed, 3 Aug 2022 00:42:54 -0700	[thread overview]
Message-ID: <EEBB9DDB-3D67-42BD-BD16-4264F40760CD@gmail.com> (raw)
In-Reply-To: <20220803012159.36551-3-peterx@redhat.com>

On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@redhat.com> wrote:

> When page migration happens, we always ignore the young bit settings in the
> old pgtable, and marking the page as old in the new page table using either
> pte_mkold() or pmd_mkold().
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> 
> Actually we can easily remember the young bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new bit in
> the migration swap offset field to show whether the old pte has young bit
> set or not.  Then when removing/recovering the migration entry, we can
> recover the young bit even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the young bit is dropped like before.

I gave it some more thought and I am less confident whether this is the best
solution. Not sure it is not either, so I am raising an alternative with
pros and cons.

An alternative would be to propagate the access bit into the page (i.e.,
using folio_set_young()) and then set it back into the PTE later (i.e.,
based on folio_test_young()). It might even seem that in general it is
better to always set the page access bit if folio_test_young().

This can be simpler and more performant. Setting the access-bit would not
impact reclaim decisions (as the page is already considered young), would
not induce overheads on clearing the access-bit (no TLB flush is needed at
least on x86), and would save the time the CPU takes to set the access bit
if the page is ever accessed (on x86).

It may also improve the preciseness of page-idle mechanisms and the
interaction with it. IIUC, page-idle does not consider migration entries, so
the user would not get indication that pages under migration are not idle.
When page-idle is reset, migrated pages might be later reinstated as
“accessed”, giving wrong indication that the pages are not-idle, when in
fact they are.

On the negative side, I am not sure whether other archs, that might require
a TLB flush for resetting the access-bit, and the overhead of doing atomic
operation to clear the access-bit, would not induce more overhead than they
would save.

One more unrelated point - note that remove_migration_pte() would always set
a clean PTE even when the old one was dirty…