Re: [PATCH v3 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext

From: David Hildenbrand <david@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Michal Privoznik" <mprivozn@redhat.com>,
	"Igor Mammedov" <imammedo@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Stefan Weil" <sw@weilnetz.de>
Subject: Re: [PATCH v3 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
Date: Thu, 27 Oct 2022 11:02:15 +0200	[thread overview]
Message-ID: <312f188d-9b0c-839f-d747-9f7c4ac95683@redhat.com> (raw)
In-Reply-To: <20221014134720.168738-1-david@redhat.com>

On 14.10.22 15:47, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>      tools cannot intercept anymore to change the affinity. These threads
>      are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>      might not necessarily be exactly the CPUs we actually want to use
>      (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, before enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>      time qemu-system-x86_64 \
>          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>          -nographic -monitor stdio
> 
>      real    0m5.383s
>      real    0m3.499s
>      real    0m5.129s
>      real    0m4.232s
>      real    0m5.220s
>      real    0m4.288s
>      real    0m3.582s
>      real    0m4.305s
>      real    0m5.421s
>      real    0m4.502s
> 
>      -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>      time qemu-system-x86_64 \
>          -object thread-context,id=tc1,node-affinity=0 \
>          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>          -sandbox enable=on,resourcecontrol=deny \
>          -nographic -monitor stdio
> 
>      real    0m1.959s
>      real    0m1.942s
>      real    0m1.943s
>      real    0m1.941s
>      real    0m1.948s
>      real    0m1.964s
>      real    0m1.949s
>      real    0m1.948s
>      real    0m1.941s
>      real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 
> While this concept is currently only used for short-lived preallocation
> threads, nothing major speaks against reusing the concept for other
> threads that are harder to identify/configure -- except that
> we need additional (idle) context threads that are otherwise left unused.
> 
> This series does not yet tackle concurrent preallocation of memory
> backends. Memory backend objects are created and memory is preallocated one
> memory backend at a time -- and there is currently no way to do
> preallocation asynchronously.
> 
> [1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com
> 
> v2 -> v3:
> * "util: Introduce ThreadContext user-creatable object"
>   -> Further impove documentation and patch description and add ACK. [Markus]
> * "util: Add write-only "node-affinity" property for ThreadContext"
>   -> Further impove documentation and patch description and add ACK. [Markus]
> 
> v1 -> v2:
> * Fixed some minor style nits
> * "util: Introduce ThreadContext user-creatable object"
>   -> Impove documentation and patch description. [Markus]
> * "util: Add write-only "node-affinity" property for ThreadContext"
>   -> Impove documentation and patch description. [Markus]
> 
> RFC -> v1:
> * "vl: Allow ThreadContext objects to be created before the sandbox option"
>   -> Move parsing of the "name" property before object_create_pre_sandbox
> * Added RB's

I'm queuing this to

https://github.com/davidhildenbrand/qemu.git mem-next

And most probably send a MR tomorrow before soft-freeze.

-- 
Thanks,

David / dhildenb