Re: [PATCH] util: NUMA aware memory preallocation

From: Paolo Bonzini <pbonzini@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	Michal Privoznik <mprivozn@redhat.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH] util: NUMA aware memory preallocation
Date: Thu, 12 May 2022 09:41:29 +0200	[thread overview]
Message-ID: <04938ba0-7ff4-df3c-348d-b679eac4fbac@redhat.com> (raw)
In-Reply-To: <YnvqYO8p/Z/7/DrO@redhat.com>

On 5/11/22 18:54, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
>> On 5/11/22 12:10, Daniel P. Berrangé wrote:
>>> I expect creating/deleting I/O threads is cheap in comparison to
>>> the work done for preallocation. If libvirt is using -preconfig
>>> and object-add to create the memory backend, then we could have
>>> option of creating the I/O threads dynamically in -preconfig mode,
>>> create the memory backend, and then delete the I/O threads again.
>>
>> I think this is very overengineered.  Michal's patch is doing the obvious
>> thing and if it doesn't work that's because Libvirt is trying to micromanage
>> QEMU.
> 
> Calling it micromanaging is putting a very negative connotation on
> this. What we're trying todo is enforce a host resource policy for
> QEMU, in a way that a compromised QEMU can't escape, which is a
> valuable protection.

I'm sorry if that was a bit exaggerated, but the negative connotation 
was intentional.

>> As mentioned on IRC, if the reason is to prevent moving around threads in
>> realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
>> level.
> 
> We use cgroups where it is available to us, but we don't always have
> the freedom that we'd like.

I understand.  I'm thinking of a new flag to sched_setscheduler that 
fixes the CPU affinity and policy of the thread and prevents changing it 
in case QEMU is compromised later.  The seccomp/SELinux sandboxes can 
prevent setting the SCHED_FIFO class without this flag.

In addition, my hunch is that this works only because the RT setup of 
QEMU is not safe against priority inversion.  IIRC the iothreads are set 
with a non-realtime priority, but actually they should have a _higher_ 
priority than the CPU threads, and the thread pool I/O bound workers 
should have an even higher priority; otherwise you have a priority 
inversion situation where an interrupt is pending that would wake up the 
CPU, but the iothreads cannot process it because they have a lower 
priority than the CPU.

So the iothread and the associated util/thread-pool.c thread pool are 
the wrong tools to solve Michal's issue; they are not meant for 
background CPU-bound work, even though they _might_ work due to their 
incorrect RT setup.

Paolo