Re: + mm-introduce-reported-pages.patch added to -mm tree

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	 akpm@linux-foundation.org, aarcange@redhat.com,
	dan.j.williams@intel.com,  dave.hansen@intel.com,
	konrad.wilk@oracle.com, lcapitulino@redhat.com,
	 mm-commits@vger.kernel.org, mst@redhat.com, osalvador@suse.de,
	pagupta@redhat.com,  pbonzini@redhat.com, riel@surriel.com,
	vbabka@suse.cz, wei.w.wang@intel.com,  willy@infradead.org,
	yang.zhang.wz@gmail.com, linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Wed, 06 Nov 2019 16:20:56 -0800	[thread overview]
Message-ID: <673862eb7f0425f638ea3fc507d0e8049ee4133c.camel@linux.intel.com> (raw)
In-Reply-To: <20191106221150.GR3016@techsingularity.net>

On Wed, 2019-11-06 at 22:11 +0000, Mel Gorman wrote:
> On Wed, Nov 06, 2019 at 09:48:14AM -0800, Alexander Duyck wrote:
> > On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kernel.org>:
> > > > > > 
> > > > > > ???I didn't have time to read through newer versions of this patch series
> > > > > > but I remember there were concerns about this functionality being pulled
> > > > > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > > really a consensus that we want something like that living in the
> > > > > > allocator?
> > > > > 
> > > > > I don???t think there is. The discussion is still ongoing (although quiet,
> > > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > > this.
> > > > 
> > > > How much time is needed to get a review? I waited 2 weeks since posting
> > > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > > hasn't changed much since v10 and that was posted back in mid September. I
> > > > have been down to making small tweaks here and there and haven't had any
> > > > real critiques on the approach since Mel had the comments about conflicts
> > > > with compaction which I addressed by allowing compaction to punt the
> > > > reporter out so that it could split and splice the lists as it walked
> > > > through them.
> > > 
> > > Well, people are busy and MM community is not a large one. I cannot
> > > really help you much other than keep poking those people and give
> > > reasonable arguments so they decide to ack your patch.
> > 
> > I get that. But v10 was posted in mid September. Back then we had a
> > discussion about addressing what Mel had mentioned and I had mentioned
> > then that I had addressed it by allowing compaction to essentially reset
> > the reporter to get it out of the list so compaction could do this split
> > and splice tumbling logic.
> > 
> 
> At the time I said "While in theory that could be addressed by always
> going through an interface maintained by the page allocator, it would be
> tricky to test the virtio case in particular."
> 
> Now, I get that you added an interface for that *but* if compaction was
> ever updated or yet another approach was taken to deal with it, virtio
> could get broken. If the page allocator itself has a bug or compaction
> has a bug, the effect is very visible. While stuff does slip through, it
> tends to be an obscure corner case. However, when virtio gets broken,
> it'll be a long time before we get it.

Specifically all we are doing is walking through a list using a pointer as
an iterator.  I would think we would likely see the page reporting blow up
pretty quickly if we were to somehow mess up the list so bad that it still
had access to a page that was no longer on the list. Other than that if
the list is just shuffled without resetting the pointer then worst case
would be that we end up with the reporting being rearmed as soon as we
were complete due to a batch of unreported pages being shuffled past the
iterator.

> Then we get onto the locking, the API appears to require the zone lock
> be held which is not particularly obvious but it gets better. The zone
> lock is a *heavy* lock now. By rights, it should be split into a
> freelist and zone metadata lock as appropriate and then split the
> freelist lock. Compaction could be updated and checked trivially, but
> not so much virtio.

The zone metadata I added is all stuff I would have preferred to put in
the freelist anyway. However doing that would cause a performance
regression since it would cause our highest order zone to share a cache
line with the zone lock. Instead I am maintaining it via the zone flags
and a separately allocated block of reported pages statistics.

If you are talking about the ph_dev_info it already has a separate mutex
to protect it when we are adding/removing it.

> > > I definitely do not intent to nack this work, I just have maintainability
> > > concerns and considering there is an alternative approach that does not
> > > require to touch page allocator internals and which we need to compare
> > > against then I do not really think there is any need to push something
> > > in right away. Or is there any pressing reason to have this merged right
> > > now?
> > 
> > The alternative approach doesn't touch the page allocator, however it
> > still has essentially the same changes to __free_one_page. I suspect the
> > performance issue seen is mostly due to the fact that because it doesn't
> > touch the page allocator it is taking the zone lock and probing the page
> > for each set bit to see if the page is still free. As such the performance
> > regression seen gets worse the lower the order used for reporting.
> > 
> 
> What confused me quite a lot is that this is enabled at compile time
> and then incurs a performance hit whether there is a hypervisor that
> even cares is involved or not. So I don't think the performance angle
> justifying this approach is a good one because this implementation has
> issues of its own. Previously I said

We are adding code. There is always going to be some performance impact.
Even if we had it implemented in the arch alloc/free code there would be
an impact as there is code there that would have to be jumped over. One
reason why I have been focused on the possible regression is that patch 2
in the series, which only replaced free_area with zone and order was
supposedly having an impact due to code being shifted around.

> 	I worry that poking too much into the internal state of the
> 	allocator will be fragile long-term. There is the arch alloc/free
> 	hooks but they are typically about protections only and does not
> 	interfere with the internal state of the allocator. Compaction
> 	pokes in as well but once the page is off the free list, the page
> 	allocator no longer cares so again there is on interference with
> 	the internal state. If the state is interefered with externally,
> 	it becomes unclear what happens if things like page merging is
> 	deferred in a way the allocator cannot control as high-order
> 	allocation requests may fail for example.
> 
> Adding an API for compaction does not get away from the problem that
> it'll be fragile to depend on the internal state of the allocator for
> correctness. Existing users that poke into the state do so as an
> optimistic shortcut but if it fails, nothing actually breaks. The free
> list reporting stuff might and will not be routinely tested.

I view what I am doing as not being too different from that. I am only
maintaining state when I can. I understand that it makes things more
fragile as we have more routes where things could go wrong, but isn't that
the case with adding any interface. I have simplified this about as far as
it can go.

All I am tracking is basically a pointer into the freelists so that we can
remember where we left off, and adding a flag indicating that those
pointers are there. If the flag is cleared we stop using the pointers. We
can also just reset to the list_head when an individual freelist is
manipulated since the list_head itself will never actually go anywhere.

> Take compaction as an example, the initial implementation of it was dumb
> as rocks and only started maintaining additional state and later poking
> into the page allocator when there was empirical evidence it was necessary.

The issue is that we are trying to do this iteratively. As such I need to
store some sort of state so that once I go off to report the small bunch
of pages I have collected I have some way to quickly get back to where I
was. Without doing that I burn a large number of cycles having to rewalk
the list.

That is why I rewrote this so that we can have the list get completely
shuffled and we don't care as long as our iterator is reset to the
list_head, or the flag indicating that the iterators are active is
cleared.

> The initial implementation of page reporting should be the same, it
> should do no harm at all to users that don't care (hiding behind
> kconfig is not good enough, use static branches) and it should not
> depend on the internal state of the page allocator and ordering of free
> lists for correctness until it's shown it's absolutely necessary.

Perhaps I didn't make it clear, the "Patches applied" I called out in my
cover page was with the kconfig option enabled. The only thing disabled
was that the virtio-balloon did not have the feature enabled in the
hypervisor as I had turned it off in QEMU:

  Test                page_fault1 (THP)     page_fault2
  Baseline         1  1209281.00  +/-0.47%   411314.00  +/-0.42%
                  16  8804587.33  +/-1.80%  3419453.00  +/-1.80%

  Patches applied  1  1209369.67  +/-0.06%   412187.00  +/-0.10%
                  16  8812606.33  +/-0.06%  3435339.33  +/-1.82%

> You say that the zone lock has to be taken in the alternative
> implementation to check if it's still free and sure, that would cost
> but unless you are isolating that page immediately then you are racing
> once the lock is released. If you are isolating immediately, then isolate
> pages in batches to amortise the loock costs.  The details of this could
> be really hard but this approach is essentially saying "everything,
> everywhere should take a small hit so the overhead is not noticeable for
> virtio users" which is a curious choice for a new feature.

The point I am getting at is lock amortization. The alternate approach
cannot really pull it off because it takes the lock, checks for buddy, if
it is present it isolates it, releases the lock, and then goes onto the
next bit. I would think of it as more of a hunt and peck approach to page
reporting as it doesn't know if is operating on stale data or not since
there is logic to set the bit but nothing to clear it.

With my approach we take the lock and then immediately start reporting
pages in batches. The iterator allows us to jump back to where we were in
the list and resume so that we process the entire zone to completion.

> Regardless of the details of any implementation, the first one should be
> basic, do no harm and be relatively simple giving just a bare interface
> to virtio/qemu/etc. Then optimise it until such point as there is no
> chance but to poke into the core.

I tried doing this without the iterator/boundary pointers. Without them I
was running about 20% slower than I am now. That was still about 10%
faster then the alternative approach was with the same page order limit,
however I didn't really feel it was appropriate for us to be running that
much slower with this feature enabled. The code itself is pretty simple.
We are essentially quarantining the pages within the freelist and
reporting them in batches until they are all quarantined. With any other
approach doing it iteratively would be challenging as the last iteration
would require walking all of memory to find the straggler pages.

I was measuring the overhead for us just doing the reporting by disabling
the madvise but leaving the feature enabled int he hypervisor. The output
of that was:

  Test                page_fault1 (THP)     page_fault2
  Patches enabled  1  1209104.67  +/-0.11%   413067.67  +/-0.43%
   MADV disabled  16  8835481.67  +/-0.29%  3463485.67  +/-0.50%

So the reporting of pages itself has little effect. The part that had
noticeable effect was when we actually started performing the madvise.
That pulled us down in the THP case but still didn't generate enough
overhead to show up in the standard 4K pages case:

  Test                page_fault1 (THP)     page_fault2
  Patches enabled  1  1210367.67  +/-0.58%   416962.00  +/-0.14%
                  16  8433236.00  +/-0.58%  3437897.67  +/-0.34%

I do plan to iterate on this once it is accepted as I figure there will be
further things we can do to optimize it. I don't assume that this is the
penultimate solution. However in my mind this is a good starting point for
things as the reporting itself has little impact until we actually start
dropping the pages from the hypervisor and forcing them to be faulted back
into the guest.

Thanks.

- Alex