Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER

From: Gavin Shan <gshan@redhat.com>
To: David Hildenbrand <david@redhat.com>,
	Alexander Duyck <alexander.duyck@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	shan.gavin@gmail.com,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER
Date: Mon, 21 Jun 2021 15:16:54 +1000	[thread overview]
Message-ID: <5ee628f8-772c-b1ed-557c-68d6a4a83415@redhat.com> (raw)
In-Reply-To: <249e5814-e644-3d82-9b38-232928af4dbd@redhat.com>

On 6/16/21 10:07 PM, David Hildenbrand wrote:
>> Indeed. 512MB pageblocks are rare, especially on systems which have been
>> up and running for long time.
>>
>> The free page reporting starts from guest. Taking an extreme case: guest has
>> 512MB memory and it's backed by one THP on host. The free page reporting won't
>> work at all.
>>
>> Besides, it seems free page reporting isn't guranteed to work all the time.
>> For example, on system where we have 4KB base page size. Freeing individual
>> 4KB pages can't come up with a free 2MB pageblock due to fragmentation.
>> In this case, the free'd page won't be reported immediately, but might be
>> reported after swapping or compaction due to memory pressure. The free page
>> isn't reported immediately at least.
> 
> Exactly, it's a pure optimization that won't work, especially when guest memory is heavily fragmented. There has to be a balance between reclaiming free memory in the hypervisor, degrading VM performance, and overhead of the feature.
> 
> Further, there are no guarantees when a VM will reuse the memory again. In the worst case, all VMs that reported free pages reuse memory at the same time. In that case, one definitely needs sufficient backend memory in the hypervisor (-> swap) to not run out of memory, and performance will be degraded.
> 
> As MST once phrased it, if the feature has a higher overhead than swapping in the hypervisor, it's of little use.
> 

Thanks for the explanation and sorry again for late response, David. I took
last week as holiday and didn't work too much.

However, it's nice to have unused pages returned back to the host. These pages
can be used by other VMs or applications running on the host.

>>
>> David, how about taking your suggestion to have different threshold size only
>> for arm64 (64KB base page size). The threshold will be smaller than pageblock_order
>> for sure. There are two ways to do so and please let me know which is the preferred
>> way to go if you (and Alex) agree to do it.
>>
>> (a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose the
>>       value. The threshold falls back to pageblock_order if isn't configurated.
>> (b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can decide
>>       its value. If it's not provided by arch, it falls back to pageblock_order.
>>
> 
> I wonder if we could further define it as a (module/cmdline) parameter and make it configurable when booting. The default could then be set based on CONFIG_PAGE_REPORTING_ORDER. CONFIG_PAGE_REPORTING_ORDER would default to pageblock_order (if easily possible) and could be special-cases to arm64 with 64k.
> 

The formal patches are posted for review. I used macro PAGE_REPORTING_ORDER
instead of CONFIG_PAGE_REPORTING_ORDER. The page reporting order (threshold)
is also exported as a module parameter, as you suggested.

>> By the way, I recently had some performance testing on different page sizes.
>> We get much more performance gain from 64KB (vs 4KB) page size in guest than
>> 512MB (vs 2MB) THP on host. It means the performance won't be affected too
>> much even the 512MB THP is splitted on arm64 host.
> 
> Yes, if one is even able to get 512MB THP populated in the hypervisor -- because once again, 512MB THP are just a bad fit for many workloads.
> 

Yeah, indeed :)

Thanks,
Gavin