Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: John Hubbard <jhubbard@nvidia.com>
To: David Hildenbrand <david@redhat.com>, Yan Zhao <yan.y.zhao@intel.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<kvm@vger.kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>,
	<mike.kravetz@oracle.com>, <apopple@nvidia.com>, <jgg@nvidia.com>,
	<rppt@kernel.org>, <akpm@linux-foundation.org>,
	<kevin.tian@intel.com>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
Date: Fri, 11 Aug 2023 11:20:46 -0700	[thread overview]
Message-ID: <1ad2c33d-95e1-49ec-acd2-ac02b506974e@nvidia.com> (raw)
In-Reply-To: <6b48a161-257b-a02b-c483-87c04b655635@redhat.com>

On 8/11/23 10:25, David Hildenbrand wrote:
...
> One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times.
> 
> One might argue that we also want to exclude pages that are mapped that often. That might possibly work.

Yes.

>>
>>> Staring at page #2, are we still missing something similar for THPs?
>> Yes.
>>
>>> Why is that MMU notifier thingy and touching KVM code required?
>> Because NUMA balancing code will firstly send .invalidate_range_start() with
>> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
>> unconditionally, before it goes down into change_pte_range() and
>> change_huge_pmd() to check each page count and apply PROT_NONE.
> 
> Ah, okay I see, thanks. That's indeed unfortunate.

Sigh. All this difficulty reminds me that this mechanism was created in
the early days of NUMA. I wonder sometimes lately whether the cost, in
complexity and CPU time, is still worth it on today's hardware.

But of course I am deeply biased, so don't take that too seriously.
See below. :)

> 
>>
>> Then current KVM will unmap all notified pages from secondary MMU
>> in .invalidate_range_start(), which could include pages that finally not
>> set to PROT_NONE in primary MMU.
>>
>> For VMs with pass-through devices, though all guest pages are pinned,
>> KVM still periodically unmap pages in response to the
>> .invalidate_range_start() notification from auto NUMA balancing, which
>> is a waste.
> 
> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are 
similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> 

Yes, NUMA balancing is incredibly harmful to performance, for GPU and
accelerators that map memory...and VMs as well, it seems. Basically,
anything that has its own processors and page tables needs to be left
strictly alone by NUMA balancing. Because the kernel is (still, even
today) unaware of what those processors are doing, and so it has no way
to do productive NUMA balancing.

thanks,
-- 
John Hubbard
NVIDIA