Re: + mm-introduce-reported-pages.patch added to -mm tree

From: Nitesh Narayan Lal <nitesh@redhat.com>
To: Michal Hocko <mhocko@kernel.org>, akpm@linux-foundation.org
Cc: aarcange@redhat.com, alexander.h.duyck@linux.intel.com,
	dan.j.williams@intel.com, dave.hansen@intel.com,
	david@redhat.com, konrad.wilk@oracle.com, lcapitulino@redhat.com,
	mgorman@techsingularity.net, mm-commits@vger.kernel.org,
	mst@redhat.com, osalvador@suse.de, pagupta@redhat.com,
	pbonzini@redhat.com, riel@surriel.com, vbabka@suse.cz,
	wei.w.wang@intel.com, willy@infradead.org,
	yang.zhang.wz@gmail.com, linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Mon, 11 Nov 2019 13:52:11 -0500	[thread overview]
Message-ID: <d8a81439-10bf-a0ff-ded3-88c0dca964bb@redhat.com> (raw)
In-Reply-To: <20191106121605.GH8314@dhcp22.suse.cz>

On 11/6/19 7:16 AM, Michal Hocko wrote:
> I didn't have time to read through newer versions of this patch series
> but I remember there were concerns about this functionality being pulled
> into the page allocator previously both by me and Mel [1][2]. Have those been 
> addressed? I do not see an ack from Mel or any other MM people. Is there
> really a consensus that we want something like that living in the
> allocator?
>
> There has also been a different approach discussed and from [3]
> (referenced by the cover letter) I can only see
>
> : Then Nitesh's solution had changed to the bitmap approach[7]. However it
> : has been pointed out that this solution doesn't deal with sparse memory,
> : hotplug, and various other issues.
>
> which looks more like something to be done than a fundamental
> roadblocks.
>
> [1] http://lkml.kernel.org/r/20190912163525.GV2739@techsingularity.net
> [2] http://lkml.kernel.org/r/20190912091925.GM4023@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com
>
[...]

Hi,

I performed some experiments to find the root cause for the performance
degradation Alexander reported with my v12 patch-set. [1]

I will try to give a brief background of the previous discussion
under v12: (Alexander can correct me if I am missing something).
Alexander suggested two issues with my v12 posting: [2]
(This is excluding the sparse zone and memory hotplug/hotremove support)

- A crash which was caused because I was not using spinlock_irqsave()
  (Fix suggestion came from Alexander).

- Performance degradation with Alexander's suggested setup. Where we are using
  modified will-it-scale/page_fault with THP, CONFIG_SLAB_FREELIST_RANDOM &
  CONFIG_SHUFFLE_PAGE_ALLOCATOR. When I was using (MAX_ORDER - 2) as the
  PAGE_REPORTING_MIN_ORDER, I also observed significant performance degradation
  (around 20% in the number of threads launched on the 16th vCPU). However, on
  switching the PAGE_REPORTING_MIN_ORDER to (MAX_ORDER - 1), I was able to get
  the performance similar to what Alexander is reporting.

PAGE_REPORTING_MIN_ORDER: is the minimum order of a page to be captured in the
bitmap and get reported to the hypervisor.

For the discussion where we are comparing the two series, the performance
aspect is more relevant and important.
It turns out that with the current implementation the number of vmexit with
PAGE_REPORTING_MIN_ORDER as pageblock_order or (MAX_ORDER - 2) are significantly
large when compared to (MAX_ODER - 1).

One of the reason could be that the lower order pages are not getting sufficient
time to merge with each other as a result they are somehow getting reported
with 2 separate reporting requests. Hence, generating more vmexits. Where
as with (MAX_ORDER - 1) we don't have that kind of situation as I never try
to report any page which has order < (MAX_ORDER - 1).

To fix this, I might have to further limit the reporting which could allow the
lower order pages to further merge and hence reduce the VM exits. I will try to
do some experiments to see if I can fix this. In any case, if anyone has a
suggestion I would be more than happy to look in that direction.

Following are the numbers I gathered on a 30GB single NUMA, 16 vCPU guest
affined to a single host-NUMA:

On 16th vCPU:
With PAGE_REPORTING_MIN_ORDER as (MAX_ORDER - 1):
% Dip on the number of Processes = 1.3 %
% Dip on the number of  Threads  = 5.7 %

With PAGE_REPORTING_MIN_ORDER as With (pageblock_order):
% Dip on the number of Processes = 5 %
% Dip on the number of  Threads  = 20 %

Michal's suggestion:
I was able to get the prototype which could use page-isolation API:
start_isolate_page_range()/undo_isolate_page_range() to work.
But the issue mentioned above was also evident with it.

Hence, I think before moving to the decision whether I want to use
__isolate_free_page() which isolates pages from the buddy or
start/undo_isolate_page_range() which just marks the page as MIGRATE_ISOLATE,
it is important for me to resolve the above-mentioned issue.

Previous discussions:
More about how we ended up with these two approaches could be found at [3] &
[4] explained by Alexander & David.

[1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
[2] https://lkml.org/lkml/2019/10/2/425
[3] https://lkml.org/lkml/2019/10/23/1166
[4] https://lkml.org/lkml/2019/9/12/48

-- 
Thanks
Nitesh