Re: + mm-introduce-reported-pages.patch added to -mm tree

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	 akpm@linux-foundation.org, aarcange@redhat.com,
	dan.j.williams@intel.com,  dave.hansen@intel.com,
	konrad.wilk@oracle.com, lcapitulino@redhat.com,
	 mm-commits@vger.kernel.org, mst@redhat.com, osalvador@suse.de,
	pagupta@redhat.com,  pbonzini@redhat.com, riel@surriel.com,
	vbabka@suse.cz, wei.w.wang@intel.com,  willy@infradead.org,
	yang.zhang.wz@gmail.com, linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Thu, 07 Nov 2019 08:07:02 -0800	[thread overview]
Message-ID: <73c0477e6e1672d5d36d4aa673b64cda10590a9d.camel@linux.intel.com> (raw)
In-Reply-To: <20191107102045.GS3016@techsingularity.net>

On Thu, 2019-11-07 at 10:20 +0000, Mel Gorman wrote:
> On Wed, Nov 06, 2019 at 04:20:56PM -0800, Alexander Duyck wrote:
> > > > I get that. But v10 was posted in mid September. Back then we had a
> > > > discussion about addressing what Mel had mentioned and I had mentioned
> > > > then that I had addressed it by allowing compaction to essentially reset
> > > > the reporter to get it out of the list so compaction could do this split
> > > > and splice tumbling logic.
> > > > 
> > > 
> > > At the time I said "While in theory that could be addressed by always
> > > going through an interface maintained by the page allocator, it would be
> > > tricky to test the virtio case in particular."
> > > 
> > > Now, I get that you added an interface for that *but* if compaction was
> > > ever updated or yet another approach was taken to deal with it, virtio
> > > could get broken. If the page allocator itself has a bug or compaction
> > > has a bug, the effect is very visible. While stuff does slip through, it
> > > tends to be an obscure corner case. However, when virtio gets broken,
> > > it'll be a long time before we get it.
> > 
> > Specifically all we are doing is walking through a list using a pointer as
> > an iterator.  I would think we would likely see the page reporting blow up
> > pretty quickly if we were to somehow mess up the list so bad that it still
> > had access to a page that was no longer on the list. Other than that if
> > the list is just shuffled without resetting the pointer then worst case
> > would be that we end up with the reporting being rearmed as soon as we
> > were complete due to a batch of unreported pages being shuffled past the
> > iterator.
> > 
> 
> And what I'm saying is that the initial version should have focused
> exclusively on the mechanism to report free pages and kept the search as
> simple as possible with optimisations on top. By all means track pages
> that are already reported and skip them because that is a relatively basic
> operation with the caveat that even the page_reported checks should be
> behind a static branch if there is no virtio balloon driver loaded.

I can split out the list manipulation logic and put it behind a static
branch, however I cannot do that with page_reported. The issue is that
with keeping any state in the page we need to clear it when the page is
allocated. I could do the static branch approach if we were to clear the
bit for all pages, but I thought it better to just do a test and clear in
order to avoid an unnecessary read-modify-write.

> The page reporting to a hypervisor is done out of band from any application
> so even if it takes a little longer to do that reporting, is there an
> impact really? If so, how much and what frequency is this impact incurred?

My main concern has been time with the zone lock held in order to extract
the page from the list. Any time while we are holding the zone lock is
time when another CPU cannot free or allocate a page from that zone.
Without the pointers I have to walk the free lists until I find a non-
reported page in the zone. That is the overhead I am avoiding by
maintaining the pointers since we have to walk the free lists for every
batch of pages we want to pull.

> The primary impact I can see is that free pages are invisible while the
> notification takes place so the window for that may be wider if it takes
> longer to find enough pages to send in batch but there will always be a
> window where the free pages are invisible.

You are correct. There is a window where the pages will be invisible. The
hope is that the impact of that should be fairly small since we only pull
a few pages out at a time. In addition we will not pull past the watermark
since the pages are pulled via __isolate_free_page.

> > > <SNIP>
> > > 
> > > What confused me quite a lot is that this is enabled at compile time
> > > and then incurs a performance hit whether there is a hypervisor that
> > > even cares is involved or not. So I don't think the performance angle
> > > justifying this approach is a good one because this implementation has
> > > issues of its own. Previously I said
> > 
> > We are adding code. There is always going to be some performance impact.
> 
> And that impact should be negligible in so far as possible, parts of
> it are protected by branches but as it stands, just building the virtio
> balloon driver enables all of this whether the running system is using
> virtual machines or not.

Agreed the code is present, however that would be the case with a static
branch as well. I tried to use static branches where I thought it made
sense and to just use the page_reported test for the cases where I
couldn't. I can probably add the static branch to a few more spots however
I am not sure it adds much value when it is already behind a page_reported
check.

> > > Adding an API for compaction does not get away from the problem that
> > > it'll be fragile to depend on the internal state of the allocator for
> > > correctness. Existing users that poke into the state do so as an
> > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > list reporting stuff might and will not be routinely tested.
> > 
> > I view what I am doing as not being too different from that. I am only
> > maintaining state when I can. I understand that it makes things more
> > fragile as we have more routes where things could go wrong, but isn't that
> > the case with adding any interface. I have simplified this about as far as
> > it can go.
> > 
> 
> The enhanced tracking of state is not necessary initially to simply report
> free pages to the hypervisor. 

Define enhanced tracking? If you are talking about the PageReported bit
then I would disagree. I need to have some way to indicate that the free
page has been reported or not if I am going to make forward progress while
performing the reporting in iterations. If you are talking about the list
manipulation and boundary pointers then I agree. However there is a pretty
significant performance impact for not having those as we have to rewalk
the free lists for every batch of pages we pull.

> > All I am tracking is basically a pointer into the freelists so that we can
> > remember where we left off, and adding a flag indicating that those
> > pointers are there. If the flag is cleared we stop using the pointers. We
> > can also just reset to the list_head when an individual freelist is
> > manipulated since the list_head itself will never actually go anywhere.
> > 
> > > Take compaction as an example, the initial implementation of it was dumb
> > > as rocks and only started maintaining additional state and later poking
> > > into the page allocator when there was empirical evidence it was necessary.
> > 
> > The issue is that we are trying to do this iteratively. As such I need to
> > store some sort of state so that once I go off to report the small bunch
> > of pages I have collected I have some way to quickly get back to where I
> > was. Without doing that I burn a large number of cycles having to rewalk
> > the list.
> > 
> > That is why I rewrote this so that we can have the list get completely
> > shuffled and we don't care as long as our iterator is reset to the
> > list_head, or the flag indicating that the iterators are active is
> > cleared.
> > 
> 
> And again, it's not clear that the additional complexity is required.
> Specifically, it'll be very hard to tell if the state tracking is
> actually helping because excessive list shuffling due to compaction may
> mean that a lot of state is being tracked while the full free list ends
> up having to be searched anyway.
> 
> As it stands, the optimisations are hard-wired into the providing of
> the feature itself making it an all or nothing approach and no option to
> "merge a bare-bones mechanism that is behind static branches and build
> optimisations on top". At least that way, any competing approach could
> be based on optimisations alone while the feature is still available.

If needed I can split another patch out of patches 3 and 4 that does the
list manipulation. It shouldn't take all that long to do as I have already
done it once for testing to get the 20% regression number. Also as I
mentioned I can put the list manpiulation behind a static key, but cannot
do so for the actual testing and clearing of the page reporting bit.

> It also doesn't help that reviewing the series takes a lot of jumping
> around. This patch has a kfree of a structure that is not allocated
> until a later patch for example.

Sorry about that. Patches 3 and 4 were one patch until I had split them. I
can fix that so that the page_reporting_reset_zone is either moved to
patch 4 where we actually perform the allocation, or I will move the
allocation to patch 3.