Re: + mm-introduce-reported-pages.patch added to -mm tree

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	 akpm@linux-foundation.org, aarcange@redhat.com,
	dan.j.williams@intel.com,  dave.hansen@intel.com,
	konrad.wilk@oracle.com, lcapitulino@redhat.com,
	 mm-commits@vger.kernel.org, mst@redhat.com, osalvador@suse.de,
	pagupta@redhat.com,  pbonzini@redhat.com, riel@surriel.com,
	vbabka@suse.cz, wei.w.wang@intel.com,  willy@infradead.org,
	yang.zhang.wz@gmail.com, linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Fri, 08 Nov 2019 08:17:49 -0800	[thread overview]
Message-ID: <4c3e787b5db923de8599eafee46c1235eacd2432.camel@linux.intel.com> (raw)
In-Reply-To: <20191108094340.GT3016@techsingularity.net>

On Fri, 2019-11-08 at 09:43 +0000, Mel Gorman wrote:
> On Thu, Nov 07, 2019 at 08:07:02AM -0800, Alexander Duyck wrote:
> > > And what I'm saying is that the initial version should have focused
> > > exclusively on the mechanism to report free pages and kept the search as
> > > simple as possible with optimisations on top. By all means track pages
> > > that are already reported and skip them because that is a relatively basic
> > > operation with the caveat that even the page_reported checks should be
> > > behind a static branch if there is no virtio balloon driver loaded.
> > 
> > I can split out the list manipulation logic and put it behind a static
> > branch, however I cannot do that with page_reported. The issue is that
> > with keeping any state in the page we need to clear it when the page is
> > allocated. I could do the static branch approach if we were to clear the
> > bit for all pages, but I thought it better to just do a test and clear in
> > order to avoid an unnecessary read-modify-write.
> > 
> 
> A test and clear (the non-atomic variety) might be fine but I don't see
> why the page_reported or zone flag test test cannot be behind a static
> branch. If no virtio balloon driver with page reporting exists then it
> does not matter if we track the state. If it's loaded for the first time,
> no page has been updated but that's also ok, it now gets to start. Maybe
> you are thinking about what happens if the driver is unloaded? If that
> is your concern, do not disable the static branch once it is enabled. I
> don't think we need to worry about handling driver unloading efficiently.

Yeah, the static branch confusion was a bit of rigid thinking on my part.
Specifically I was thinking that if the virtio-device is disabled I needed
to disable the static key. Instead I have already updated the code so that
we just enable the branch the first time, and then rely on the ph_dev
being set to NULL disabling reporting when the interface is unregistered.

> > > The page reporting to a hypervisor is done out of band from any application
> > > so even if it takes a little longer to do that reporting, is there an
> > > impact really? If so, how much and what frequency is this impact incurred?
> > 
> > My main concern has been time with the zone lock held in order to extract
> > the page from the list. Any time while we are holding the zone lock is
> > time when another CPU cannot free or allocate a page from that zone.
> > Without the pointers I have to walk the free lists until I find a non-
> > reported page in the zone. That is the overhead I am avoiding by
> > maintaining the pointers since we have to walk the free lists for every
> > batch of pages we want to pull.
> > 
> 
> I'm not too worried about the zone lock hold time to be honest as long
> as it's not long enough to cause soft lockups. The very nature of the
> feature is that there is a performance penalty to favour packing more
> VMs into a physical host. Even if the zone lock hold wasn't a factor,
> the reallocation of pages after they have been freed by the hypervisor
> will be a slowdown. Anyone using this feature has already accepted a
> performance hit and this is why I'm not massively concerned about the
> performance of the reporting itself.

Okay.

<snip>

> > > > > What confused me quite a lot is that this is enabled at compile time
> > > > > and then incurs a performance hit whether there is a hypervisor that
> > > > > even cares is involved or not. So I don't think the performance angle
> > > > > justifying this approach is a good one because this implementation has
> > > > > issues of its own. Previously I said
> > > > 
> > > > We are adding code. There is always going to be some performance impact.
> > > 
> > > And that impact should be negligible in so far as possible, parts of
> > > it are protected by branches but as it stands, just building the virtio
> > > balloon driver enables all of this whether the running system is using
> > > virtual machines or not.
> > 
> > Agreed the code is present, however that would be the case with a static
> > branch as well. I tried to use static branches where I thought it made
> > sense and to just use the page_reported test for the cases where I
> > couldn't. I can probably add the static branch to a few more spots however
> > I am not sure it adds much value when it is already behind a page_reported
> > check.
> > 
> 
> It's simply an ideal from my perspective. Other features like cpusets
> have zero impact if they are not in use.

Got it.

> > > > > Adding an API for compaction does not get away from the problem that
> > > > > it'll be fragile to depend on the internal state of the allocator for
> > > > > correctness. Existing users that poke into the state do so as an
> > > > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > > > list reporting stuff might and will not be routinely tested.
> > > > 
> > > > I view what I am doing as not being too different from that. I am only
> > > > maintaining state when I can. I understand that it makes things more
> > > > fragile as we have more routes where things could go wrong, but isn't that
> > > > the case with adding any interface. I have simplified this about as far as
> > > > it can go.
> > > > 
> > > 
> > > The enhanced tracking of state is not necessary initially to simply report
> > > free pages to the hypervisor. 
> > 
> > Define enhanced tracking?
> 
> Primarily the list manipulations and boundary pointers.

Okay, those are going to be pulled out into a separate patch. What I may
do is merge patches 3 and 4, and then pull the list manipulation and
boundary logic back out into a 4th patch since pulling out the list
manipulation will shrink patch 3 by quite a bit.

> > If you are talking about the PageReported bit
> > then I would disagree. I need to have some way to indicate that the free
> > page has been reported or not if I am going to make forward progress while
> > performing the reporting in iterations.
> 
> I'm not concerned about the PageReported bit. I fully accept that we
> should not be isolating and re-reporting pages that have already been
> handled by the hypervisor.

Okay. Sounds good.

> > > And again, it's not clear that the additional complexity is required.
> > > Specifically, it'll be very hard to tell if the state tracking is
> > > actually helping because excessive list shuffling due to compaction may
> > > mean that a lot of state is being tracked while the full free list ends
> > > up having to be searched anyway.
> > > 
> > > As it stands, the optimisations are hard-wired into the providing of
> > > the feature itself making it an all or nothing approach and no option to
> > > "merge a bare-bones mechanism that is behind static branches and build
> > > optimisations on top". At least that way, any competing approach could
> > > be based on optimisations alone while the feature is still available.
> > 
> > If needed I can split another patch out of patches 3 and 4 that does the
> > list manipulation. It shouldn't take all that long to do as I have already
> > done it once for testing to get the 20% regression number.
> 
> From your perspective, I see it's a bit annoying because in the final
> result, the code should be identical. However, it'll be a lot clearer
> during review what is required, what level of complexity optimisations
> add and the performance of it. The changelog should include what metric
> you are using to evaluate the performance, the test case and the delta. It
> also will be easier from a debugging perspective as minimally a bisection
> could identify if a bug was due to the core mechanism itself or one of
> the optimisations.  Finally, it leaves open the possibility that someone
> can evaluate a completely different set of optimisations. Whatever the
> alternative approaches are, the actual interface to virtio ballon surely
> is the same (I don't actually know, I just can't see why the virtio ABI
> would be depend on how the pages are isolated, tracked and reported).

The virtio-balloon interface is the same at this point between my solution
and Nitesh's. So the only real disagreement in terms of the two solutions
is about keeping the bit in the page and the list manipulation versus the
external bitmap and the hunt and peck approach.