Re: [PATCH] xen: speed up grant-table reclaim

From: Juergen Gross <jgross@suse.com>
To: Demi Marie Obenour <demi@invisiblethingslab.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Jan Beulich <JBeulich@suse.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xen developer discussion <xen-devel@lists.xenproject.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] xen: speed up grant-table reclaim
Date: Tue, 13 Jun 2023 08:45:31 +0200	[thread overview]
Message-ID: <1150fd9c-25c8-c91c-4ca2-2c587e3fbb43@suse.com> (raw)
In-Reply-To: <ZId7ixbqlCSygtvb@itl-email>

[-- Attachment #1.1.1: Type: text/plain, Size: 10030 bytes --]

On 12.06.23 22:09, Demi Marie Obenour wrote:
> On Mon, Jun 12, 2023 at 08:27:59AM +0200, Juergen Gross wrote:
>> On 10.06.23 17:32, Demi Marie Obenour wrote:
>>> When a grant entry is still in use by the remote domain, Linux must put
>>> it on a deferred list.
>>
>> This lacks quite some context.
>>
>> The main problem is related to the grant not having been unmapped after
>> the end of a request, but the side granting the access is assuming this
>> should be the case.
> 
> The GUI agent has relied on deferred grant reclaim for as long as it has
> existed.  One could argue that doing so means that the agent is misusing
> gntalloc, but this is not documented anywhere.  A better fix would be to
> use IOCTL_GNTDEV_SET_UNMAP_NOTIFY in the GUI daemon.

I don't think this is a gntalloc specific problem. You could create the same
problem with a kernel implementation.

And yes, using IOCTL_GNTDEV_SET_UNMAP_NOTIFY might be a good idea.

>> In general this means that the two sides implementing the protocol don't
>> agree how it should work, or that the protocol itself has a flaw.
> 
> What would a better solution be?  This is going to be particularly
> tricky with Wayland, as the wl_shm protocol makes absolutely no
> guarantees that compositors will promptly release the mapping and
> provides no way whatsoever for Wayland clients to know when this has
> happened.  Relying on an LD_PRELOAD hack is not sustainable.
> 
>>> Normally, this list is very short, because
>>> the PV network and block protocols expect the backend to unmap the grant
>>> first.
>>
>> Normally the list is just empty. Only in very rare cases like premature
>> PV frontend module unloading it is expected to see cases of deferred
>> grant reclaims.
> 
> In the case of a system using only properly-designed PV protocols
> implemented in kernel mode I agree.  However, both libxenvchan and the
> Qubes GUI protocol are implemented in user mode and this means that if
> the frontend process (the one that uses gntalloc) crashes, deferred
> grant reclaims will occur.

This would be the user space variant of frontend driver unloading.

> Worse, it is possible for the domain to use
> the grant in a PV protocol.  If the PV backend driver then maps and
> unmaps the grant and then tells the frontend driver to reclaim it, but
> the backend userspace process (the one using gntdev) maps it before the
> frontend does reclaim it, the frontend will think the backend is trying
> to exploit XSA-396 and freeze the connection.

Let me sum up how I understand above statement:

1. Frontend F in DomA is granting DomB access via grant X for PV-device Y.
2. DomB backend B for PV-device Y is mapping the page using grant X and uses it.
3. DomB backend B unmaps grant X and signals end of usage to DomA frontend F.
4. DomB userspace process maps grant X
5. DomA frontend F tries to reclaim grant X and fails due to DomB userspace
    mapping

So why would DomB userspace process map grant X? This would imply a malicious
process in DomB. This might be possible, but I don't see how such an attack
would relate to your problem. It could happen with any malicious userspace
program with root access in a domain running a PV backend.

> 
>>> However, Qubes OS's GUI protocol is subject to the constraints
>>> of the X Window System, and as such winds up with the frontend unmapping
>>> the window first.  As a result, the list can grow very large, resulting
>>> in a massive memory leak and eventual VM freeze.
>>
>> I do understand that it is difficult to change the protocol and/or
>> behavior after the fact, or that performance considerations are in the
>> way of doing so.
> 
> Would the correct fix be to use IOCTL_GNTDEV_SET_UNMAP_NOTIFY?  That
> would require that the agent either create a new event channel for each
> window or maintain a pool of event channels, but that should be doable.

I think this sounds like a promising idea.

> This still does not solve the problem of the frontend exiting
> unexpectedly, though.

Such cases need to be handled via deferred reclaim. You could add a flag
to struct deferred_entry when called through gntalloc_vma_close(),
indicating that this is an expected deferred reclaim not leading to
error messages.

> 
>>> To partially solve this problem, make the number of entries that the VM
>>> will attempt to free at each iteration tunable.  The default is still
>>> 10, but it can be overridden at compile-time (via Kconfig), boot-time
>>> (via a kernel command-line option), or runtime (via sysfs).
>>
>> Is there really a need to have another Kconfig option for this? AFAICS
>> only QubesOS is affected by the problem you are trying to solve. I don't
>> see why you can't use the command-line option or sysfs node to set the
>> higher reclaim batch size.
> 
> Fair.  In practice, Qubes OS will need to use the sysfs node, since
> the other two do not work with in-VM kernels.
> 
>>> Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")
>>
>> I don't think this "Fixes:" tag is appropriate. The mentioned commit didn't
>> have a bug. You are adding new functionality on top of it.
> 
> I’ll drop the "Fixes:" tag, but I will keep the "Cc: stable@vger.kernel.org"
> as I believe this patch meets the following criterion for stable
> backport (from Documentation/process/stable-kernel-rules.rst):
> 
>      Serious issues as reported by a user of a distribution kernel may also
>      be considered if they fix a notable performance or interactivity issue.
> 

Sure, this seems appropriate.

>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
>>> ---
>>>    drivers/xen/Kconfig       | 12 ++++++++++++
>>>    drivers/xen/grant-table.c | 40 ++++++++++++++++++++++++++++-----------
>>>    2 files changed, 41 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
>>> index d5d7c402b65112b8592ba10bd3fd1732c26b771e..8f96e1359eb102d6420775b66e7805004a4ce9fe 100644
>>> --- a/drivers/xen/Kconfig
>>> +++ b/drivers/xen/Kconfig
>>> @@ -65,6 +65,18 @@ config XEN_MEMORY_HOTPLUG_LIMIT
>>>    	  This value is used to allocate enough space in internal
>>>    	  tables needed for physical memory administration.
>>> +config XEN_GRANTS_RECLAIM_PER_ITERATION
>>> +	int "Default number of grant entries to reclaim per iteration"
>>> +	default 10
>>> +	range 10 4294967295
>>> +	help
>>> +	  This sets the default value for the grant_table.free_per_iteration
>>> +	  kernel command line option, which sets the number of grants that
>>> +	  Linux will try to reclaim at once.  The default is 10, but
>>> +	  workloads that make heavy use of gntalloc will likely want to
>>> +	  increase this.  The current value can be accessed and/or modified
>>> +	  via /sys/module/grant_table/parameters/free_per_iteration.
>>> +
>>
>> Apart from the fact that I don't like adding a new Kconfig option, the help
>> text is not telling the true story. Heavy use of gntalloc isn't to blame, but
>> the fact that your PV-device implementation is based on the reclaim
>> functionality. TBH, someone not familiar with the grant reclaim will have no
>> chance to select a sensible value based on your help text.
> 
> That’s a good point.  Qubes OS will need to use the sysfs value anyway
> in order to support in-VM kernels, so the Kconfig option is not really
> useful.

Nice.

> 
>>>    config XEN_SCRUB_PAGES_DEFAULT
>>>    	bool "Scrub pages before returning them to system by default"
>>>    	depends on XEN_BALLOON
>>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>>> index e1ec725c2819d4d5dede063eb00d86a6d52944c0..fa666aa6abc3e786dddc94f895641505ec0b23d8 100644
>>> --- a/drivers/xen/grant-table.c
>>> +++ b/drivers/xen/grant-table.c
>>> @@ -498,14 +498,20 @@ static LIST_HEAD(deferred_list);
>>>    static void gnttab_handle_deferred(struct timer_list *);
>>>    static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
>>> +static atomic64_t deferred_count;
>>> +static atomic64_t leaked_count;
>>> +static unsigned int free_per_iteration = CONFIG_XEN_GRANTS_RECLAIM_PER_ITERATION;
>>> +
>>>    static void gnttab_handle_deferred(struct timer_list *unused)
>>>    {
>>> -	unsigned int nr = 10;
>>> +	unsigned int nr = READ_ONCE(free_per_iteration);
>>> +	const bool ignore_limit = nr == 0;
>>>    	struct deferred_entry *first = NULL;
>>>    	unsigned long flags;
>>> +	size_t freed = 0;
>>>    	spin_lock_irqsave(&gnttab_list_lock, flags);
>>> -	while (nr--) {
>>> +	while ((ignore_limit || nr--) && !list_empty(&deferred_list)) {
>>>    		struct deferred_entry *entry
>>>    			= list_first_entry(&deferred_list,
>>>    					   struct deferred_entry, list);
>>> @@ -515,10 +521,13 @@ static void gnttab_handle_deferred(struct timer_list *unused)
>>>    		list_del(&entry->list);
>>>    		spin_unlock_irqrestore(&gnttab_list_lock, flags);
>>>    		if (_gnttab_end_foreign_access_ref(entry->ref)) {
>>> +			uint64_t ret = atomic64_sub_return(1, &deferred_count);
>>>    			put_free_entry(entry->ref);
>>> -			pr_debug("freeing g.e. %#x (pfn %#lx)\n",
>>> -				 entry->ref, page_to_pfn(entry->page));
>>> +			pr_debug("freeing g.e. %#x (pfn %#lx), %llu remaining\n",
>>> +				 entry->ref, page_to_pfn(entry->page),
>>> +				 (unsigned long long)ret);
>>
>> I'd prefer not to issue lots of prints (being it debug or info ones) if the
>> reclaim is expected to happen with a specific PV-device.
>>
>> My preferred solution would be a per-device flag, but at least you should
>> switch to pr_*_ratelimited(). Same applies further down.
> 
> What do you mean by “per-device flag”?  These grants are allocated by
> userspace using gntalloc, so there is no PV device on which the flag
> could be set.

In this case the flag should relate to the file pointer used for
gntalloc_open().

Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]