Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>, Yan Zhao <yan.y.zhao@intel.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com,
	rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
Date: Fri, 11 Aug 2023 20:39:46 +0200	[thread overview]
Message-ID: <846e9117-1f79-a5e0-1b14-3dba91ab8033@redhat.com> (raw)
In-Reply-To: <1ad2c33d-95e1-49ec-acd2-ac02b506974e@nvidia.com>

>> Ah, okay I see, thanks. That's indeed unfortunate.
> 
> Sigh. All this difficulty reminds me that this mechanism was created in
> the early days of NUMA. I wonder sometimes lately whether the cost, in
> complexity and CPU time, is still worth it on today's hardware.
> 
> But of course I am deeply biased, so don't take that too seriously.
> See below. :)

:)

>>
>>>
>>> Then current KVM will unmap all notified pages from secondary MMU
>>> in .invalidate_range_start(), which could include pages that finally not
>>> set to PROT_NONE in primary MMU.
>>>
>>> For VMs with pass-through devices, though all guest pages are pinned,
>>> KVM still periodically unmap pages in response to the
>>> .invalidate_range_start() notification from auto NUMA balancing, which
>>> is a waste.
>>
>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>
> 
> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> accelerators that map memory...and VMs as well, it seems. Basically,
> anything that has its own processors and page tables needs to be left
> strictly alone by NUMA balancing. Because the kernel is (still, even
> today) unaware of what those processors are doing, and so it has no way
> to do productive NUMA balancing.

Is there any existing way we could handle that better on a per-VMA 
level, or on the process level? Any magic toggles?

MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might 
be better, but with things like iouring still too restrictive eventually.

I recall that setting a mempolicy could prevent auto-numa from getting 
active, but that might be undesired.

CCing Mel.

-- 
Cheers,

David / dhildenb