Re: [PATCH] xen: speed up grant-table reclaim

From: Juergen Gross <jgross@suse.com>
To: Demi Marie Obenour <demi@invisiblethingslab.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Jan Beulich <JBeulich@suse.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xen developer discussion <xen-devel@lists.xenproject.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] xen: speed up grant-table reclaim
Date: Mon, 12 Jun 2023 08:27:59 +0200	[thread overview]
Message-ID: <85275900-6b6a-5391-a2a0-16704df3f00f@suse.com> (raw)
In-Reply-To: <20230610153232.1859-1-demi@invisiblethingslab.com>

[-- Attachment #1.1.1: Type: text/plain, Size: 7914 bytes --]

On 10.06.23 17:32, Demi Marie Obenour wrote:
> When a grant entry is still in use by the remote domain, Linux must put
> it on a deferred list.

This lacks quite some context.

The main problem is related to the grant not having been unmapped after
the end of a request, but the side granting the access is assuming this
should be the case.

In general this means that the two sides implementing the protocol don't
agree how it should work, or that the protocol itself has a flaw.

 > Normally, this list is very short, because
> the PV network and block protocols expect the backend to unmap the grant
> first.

Normally the list is just empty. Only in very rare cases like premature
PV frontend module unloading it is expected to see cases of deferred
grant reclaims.

> However, Qubes OS's GUI protocol is subject to the constraints
> of the X Window System, and as such winds up with the frontend unmapping
> the window first.  As a result, the list can grow very large, resulting
> in a massive memory leak and eventual VM freeze.

I do understand that it is difficult to change the protocol and/or
behavior after the fact, or that performance considerations are in the
way of doing so.

> To partially solve this problem, make the number of entries that the VM
> will attempt to free at each iteration tunable.  The default is still
> 10, but it can be overridden at compile-time (via Kconfig), boot-time
> (via a kernel command-line option), or runtime (via sysfs).

Is there really a need to have another Kconfig option for this? AFAICS
only QubesOS is affected by the problem you are trying to solve. I don't
see why you can't use the command-line option or sysfs node to set the
higher reclaim batch size.

> 
> Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")

I don't think this "Fixes:" tag is appropriate. The mentioned commit didn't
have a bug. You are adding new functionality on top of it.

> Cc: stable@vger.kernel.org
> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> ---
>   drivers/xen/Kconfig       | 12 ++++++++++++
>   drivers/xen/grant-table.c | 40 ++++++++++++++++++++++++++++-----------
>   2 files changed, 41 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index d5d7c402b65112b8592ba10bd3fd1732c26b771e..8f96e1359eb102d6420775b66e7805004a4ce9fe 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -65,6 +65,18 @@ config XEN_MEMORY_HOTPLUG_LIMIT
>   	  This value is used to allocate enough space in internal
>   	  tables needed for physical memory administration.
>   
> +config XEN_GRANTS_RECLAIM_PER_ITERATION
> +	int "Default number of grant entries to reclaim per iteration"
> +	default 10
> +	range 10 4294967295
> +	help
> +	  This sets the default value for the grant_table.free_per_iteration
> +	  kernel command line option, which sets the number of grants that
> +	  Linux will try to reclaim at once.  The default is 10, but
> +	  workloads that make heavy use of gntalloc will likely want to
> +	  increase this.  The current value can be accessed and/or modified
> +	  via /sys/module/grant_table/parameters/free_per_iteration.
> +

Apart from the fact that I don't like adding a new Kconfig option, the help
text is not telling the true story. Heavy use of gntalloc isn't to blame, but
the fact that your PV-device implementation is based on the reclaim
functionality. TBH, someone not familiar with the grant reclaim will have no
chance to select a sensible value based on your help text.

>   config XEN_SCRUB_PAGES_DEFAULT
>   	bool "Scrub pages before returning them to system by default"
>   	depends on XEN_BALLOON
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index e1ec725c2819d4d5dede063eb00d86a6d52944c0..fa666aa6abc3e786dddc94f895641505ec0b23d8 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -498,14 +498,20 @@ static LIST_HEAD(deferred_list);
>   static void gnttab_handle_deferred(struct timer_list *);
>   static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
>   
> +static atomic64_t deferred_count;
> +static atomic64_t leaked_count;
> +static unsigned int free_per_iteration = CONFIG_XEN_GRANTS_RECLAIM_PER_ITERATION;
> +
>   static void gnttab_handle_deferred(struct timer_list *unused)
>   {
> -	unsigned int nr = 10;
> +	unsigned int nr = READ_ONCE(free_per_iteration);
> +	const bool ignore_limit = nr == 0;
>   	struct deferred_entry *first = NULL;
>   	unsigned long flags;
> +	size_t freed = 0;
>   
>   	spin_lock_irqsave(&gnttab_list_lock, flags);
> -	while (nr--) {
> +	while ((ignore_limit || nr--) && !list_empty(&deferred_list)) {
>   		struct deferred_entry *entry
>   			= list_first_entry(&deferred_list,
>   					   struct deferred_entry, list);
> @@ -515,10 +521,13 @@ static void gnttab_handle_deferred(struct timer_list *unused)
>   		list_del(&entry->list);
>   		spin_unlock_irqrestore(&gnttab_list_lock, flags);
>   		if (_gnttab_end_foreign_access_ref(entry->ref)) {
> +			uint64_t ret = atomic64_sub_return(1, &deferred_count);
>   			put_free_entry(entry->ref);
> -			pr_debug("freeing g.e. %#x (pfn %#lx)\n",
> -				 entry->ref, page_to_pfn(entry->page));
> +			pr_debug("freeing g.e. %#x (pfn %#lx), %llu remaining\n",
> +				 entry->ref, page_to_pfn(entry->page),
> +				 (unsigned long long)ret);

I'd prefer not to issue lots of prints (being it debug or info ones) if the
reclaim is expected to happen with a specific PV-device.

My preferred solution would be a per-device flag, but at least you should
switch to pr_*_ratelimited(). Same applies further down.

>   			put_page(entry->page);
> +			freed++;
>   			kfree(entry);
>   			entry = NULL;
>   		} else {
> @@ -530,21 +539,22 @@ static void gnttab_handle_deferred(struct timer_list *unused)
>   		spin_lock_irqsave(&gnttab_list_lock, flags);
>   		if (entry)
>   			list_add_tail(&entry->list, &deferred_list);
> -		else if (list_empty(&deferred_list))
> -			break;
>   	}
> -	if (!list_empty(&deferred_list) && !timer_pending(&deferred_timer)) {
> +	if (list_empty(&deferred_list))
> +		WARN_ON(atomic64_read(&deferred_count));
> +	else if (!timer_pending(&deferred_timer)) {
>   		deferred_timer.expires = jiffies + HZ;
>   		add_timer(&deferred_timer);
>   	}
>   	spin_unlock_irqrestore(&gnttab_list_lock, flags);
> +	pr_debug("Freed %zu references", freed);
>   }
>   
>   static void gnttab_add_deferred(grant_ref_t ref, struct page *page)
>   {
>   	struct deferred_entry *entry;
>   	gfp_t gfp = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : GFP_KERNEL;
> -	const char *what = KERN_WARNING "leaking";
> +	uint64_t leaked, deferred;
>   
>   	entry = kmalloc(sizeof(*entry), gfp);
>   	if (!page) {
> @@ -567,12 +577,20 @@ static void gnttab_add_deferred(grant_ref_t ref, struct page *page)
>   			add_timer(&deferred_timer);
>   		}
>   		spin_unlock_irqrestore(&gnttab_list_lock, flags);
> -		what = KERN_DEBUG "deferring";
> +		deferred = atomic64_add_return(1, &deferred_count);
> +		leaked = atomic64_read(&leaked_count);
> +		pr_debug("deferring g.e. %#x (pfn %#lx) (total deferred %llu, total leaked %llu)\n",
> +			 ref, page ? page_to_pfn(page) : -1, deferred, leaked);
> +	} else {
> +		deferred = atomic64_read(&deferred_count);
> +		leaked = atomic64_add_return(1, &leaked_count);
> +		pr_warn("leaking g.e. %#x (pfn %#lx) (total deferred %llu, total leaked %llu)\n",
> +			ref, page ? page_to_pfn(page) : -1, deferred, leaked);
>   	}
> -	printk("%s g.e. %#x (pfn %#lx)\n",
> -	       what, ref, page ? page_to_pfn(page) : -1);
>   }
>   
> +module_param(free_per_iteration, uint, 0600);
> +

Could you please move the parameter definition closer to the related
variable definition?

Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]