Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER

From: David Hildenbrand <david@redhat.com>
To: Gavin Shan <gshan@redhat.com>,
	Alexander Duyck <alexander.duyck@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	shan.gavin@gmail.com,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER
Date: Wed, 16 Jun 2021 14:07:35 +0200	[thread overview]
Message-ID: <249e5814-e644-3d82-9b38-232928af4dbd@redhat.com> (raw)
In-Reply-To: <3adbcad8-1016-cf48-4574-799de0bba6e4@redhat.com>

> Indeed. 512MB pageblocks are rare, especially on systems which have been
> up and running for long time.
> 
> The free page reporting starts from guest. Taking an extreme case: guest has
> 512MB memory and it's backed by one THP on host. The free page reporting won't
> work at all.
> 
> Besides, it seems free page reporting isn't guranteed to work all the time.
> For example, on system where we have 4KB base page size. Freeing individual
> 4KB pages can't come up with a free 2MB pageblock due to fragmentation.
> In this case, the free'd page won't be reported immediately, but might be
> reported after swapping or compaction due to memory pressure. The free page
> isn't reported immediately at least.

Exactly, it's a pure optimization that won't work, especially when guest 
memory is heavily fragmented. There has to be a balance between 
reclaiming free memory in the hypervisor, degrading VM performance, and 
overhead of the feature.

Further, there are no guarantees when a VM will reuse the memory again. 
In the worst case, all VMs that reported free pages reuse memory at the 
same time. In that case, one definitely needs sufficient backend memory 
in the hypervisor (-> swap) to not run out of memory, and performance 
will be degraded.

As MST once phrased it, if the feature has a higher overhead than 
swapping in the hypervisor, it's of little use.

> 
> David, how about taking your suggestion to have different threshold size only
> for arm64 (64KB base page size). The threshold will be smaller than pageblock_order
> for sure. There are two ways to do so and please let me know which is the preferred
> way to go if you (and Alex) agree to do it.
> 
> (a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose the
>       value. The threshold falls back to pageblock_order if isn't configurated.
> (b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can decide
>       its value. If it's not provided by arch, it falls back to pageblock_order.
> 

I wonder if we could further define it as a (module/cmdline) parameter 
and make it configurable when booting. The default could then be set 
based on CONFIG_PAGE_REPORTING_ORDER. CONFIG_PAGE_REPORTING_ORDER would 
default to pageblock_order (if easily possible) and could be 
special-cases to arm64 with 64k.

> By the way, I recently had some performance testing on different page sizes.
> We get much more performance gain from 64KB (vs 4KB) page size in guest than
> 512MB (vs 2MB) THP on host. It means the performance won't be affected too
> much even the 512MB THP is splitted on arm64 host.

Yes, if one is even able to get 512MB THP populated in the hypervisor -- 
because once again, 512MB THP are just a bad fit for many workloads.

-- 
Thanks,

David / dhildenb