Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED

From: Yang Shi <shy828301@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>, Shakeel Butt <shakeelb@google.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Zi Yan <ziy@nvidia.com>, Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"George G. Davis" <davis.george@siemens.com>
Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED
Date: Mon, 29 Nov 2021 14:00:40 -0800	[thread overview]
Message-ID: <CAHbLzkpT3cU0PfmD2LeM9LuYjUhZhArRHxaP7JAU0koeDSXHvw@mail.gmail.com> (raw)
In-Reply-To: <d068c48d-897d-238b-010c-056951c9e3f1@redhat.com>

On Fri, Nov 26, 2021 at 1:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.11.21 03:52, Peter Xu wrote:
> > On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote:
> >> On 25.11.21 11:24, Peter Xu wrote:
> >>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> >>>>> Do we have a performance evaluation how much overhead is added e.g., for
> >>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> >>>>> covers the whole THP?
> >>>>
> >>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> >>>> x86 for both settings you suggested. I don't see any statistically
> >>>> significant difference with and without the patch. Let me know if you
> >>>> want me to try something else.
> >>>
> >>> I'm a bit surprised that sync split thp didn't bring any extra overhead.
> >>>
> >>> "unmap whole thp" is understandable from that pov, because afaict that won't
> >>> even trigger any thp split anyway even delayed, if this is the simplest case
> >>> that only this process mapped this thp, and it mapped once.
> >>>
> >>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> >>> fast; while what I don't understand since thp split requires all hand-made work
> >>> for copying thp flags into small pages and so on, so I thought there should at
> >>> least be some overhead measured.  Shakeel, could there be something overlooked
> >>> in the test, or maybe it's me that overlooked?
> >>>
> >>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm
> >>> worried proactively splitting simply because any 4k page is zapped might
> >>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
> >>> the defragmentation of the system memory in general.  I'm also not sure whether
> >>> that's ideal for some very common workload that frequently uses DONTNEED to
> >>> proactively drop some pages.
> >>
> >> The pageblock corresponding to the THP is movable. So (unless we start
> >> spilling unmovable allocations into movable pageblocks) we'd only place
> >> movable allocations in there. Compaction will be able to migrate to
> >> re-create a free THP.
> >>
> >> In contrast I think, compaction will happily skip over the THP and
> >> ignore it, because it has no clue that the THP could be repurposed by
> >> split+migrate (at least I am not aware of code that does it).
> >>
> >> Unless I am missing something, with the above in mind it could make
> >> sense to split as soon as possible, even before we're under memory
> >> pressure -- for example, for proactive compaction.
> >>
> >> [proactive compaction could try splitting first as well I think]
> >
> > But we can't rely on proactive compaction for rapid operations, because it's
> > still adding overhead to the overall system by split+merge, right?
>
> Yes, but there is also direct compaction that can be triggered without
> the shrinker getting involved. I think we can summarize as "there might
> not be a right or wrong when to split". An application that
> MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to
> consume memory, yet it looks like it's still consuming that memory.
>
> I do wonder how THP on the deferred split queue behave in respect to
> page migration -- memory offlining, alloc_contig_range(). I saw reports

IIUC the old page that is on deferred split queue would be freed and
off the list once the migration is done. But the new page is *NOT*
added to the deferred split list at all. This would not cause any
severe bugs but the partial unmapped pages can no longer be shrunk
under memory pressure.

> that there are some cases where THP can be problematic when
> stress-testing THP:
> https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt
>
> But not sure if that's related to deferred splitting. Most probably not.
>
> >
> > +compaction_proactiveness
> > +========================
> > + ...
> > +Note that compaction has a non-trivial system-wide impact as pages
> > +belonging to different processes are moved around, which could also lead
> > +to latency spikes in unsuspecting applications. The kernel employs
> > +various heuristics to avoid wasting CPU cycles if it detects that
> > +proactive compaction is not being effective.
> >
> > Delaying split makes sense to me because after all the kernel is not aware of
> > the userspace's preference, so the best thing is to do nothing until necessary.
> >
> > Proactively split thps in dontneed/unmap added an assumption that the userspace
> > wants to break the pages by default.  It's 100% true for Shakeel's use case,
> > but I'm not sure whether it may always be true.  That's why I thought maybe a
> > new interface is more proper, so we at least won't break anyone by accident.
>
> Well, we already broke the PMD into PTEs. So the performance gain at
> least for that user is really gone until we "fix that" again via
> khugepaged -- which might just be configured to not "fix" if there are
> empty PTEs.
>
> It for sure is interesting if you have a COW huge page and only one
> party zaps/unmaps some part.
>
>
> --
> Thanks,
>
> David / dhildenb
>