Re: [PATCH 1/4] xen: Introduce non-broken hypercalls for the p2m pool size

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"Julien Grall" <julien@xen.org>,
	"Volodymyr Babchuk" <Volodymyr_Babchuk@epam.com>,
	"Bertrand Marquis" <bertrand.marquis@arm.com>,
	"Henry Wang" <Henry.Wang@arm.com>,
	"Anthony PERARD" <anthony.perard@citrix.com>,
	Xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH 1/4] xen: Introduce non-broken hypercalls for the p2m pool size
Date: Wed, 26 Oct 2022 15:42:15 +0200	[thread overview]
Message-ID: <ffb8bdb8-f54b-2107-ce1a-775337c172ac@suse.com> (raw)
In-Reply-To: <20221026102018.4144-2-andrew.cooper3@citrix.com>

On 26.10.2022 12:20, Andrew Cooper wrote:
> The existing XEN_DOMCTL_SHADOW_OP_{GET,SET}_ALLOCATION have problems:
> 
>  * All set_allocation() flavours have an overflow-before-widen bug when
>    calculating "sc->mb << (20 - PAGE_SHIFT)".
>  * All flavours have a granularity of of 1M.  This was tolerable when the size
>    of the pool could only be set at the same granularity, but is broken now
>    that ARM has a 16-page stopgap allocation in use.
>  * All get_allocation() flavours round up, and in particular turn 0 into 1,
>    meaning the get op returns junk before a successful set op.
>  * The x86 flavours reject the hypercalls before the VM has vCPUs allocated,
>    despite the pool size being a domain property.

I guess this is merely a remnant and could easily be dropped there.

>  * Even the hypercall names are long-obsolete.
> 
> Implement an interface that doesn't suck, which can be first used to unit test
> the behaviour, and subsequently correct a broken implementation.  The old
> interface will be retired in due course.
> 
> This is part of XSA-409 / CVE-2022-33747.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Xen Security Team <security@xen.org>
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> CC: Bertrand Marquis <bertrand.marquis@arm.com>
> CC: Henry Wang <Henry.Wang@arm.com>
> CC: Anthony PERARD <anthony.perard@citrix.com>
> 
> Name subject to improvement.

paging_{get,set}_mempool_size() for the arch helpers (in particular
fitting better with them living in paging.c as well its multi-purpose use
on x86) and XEN_DOMCTL_{get,set}_paging_mempool_size? Perhaps even the
"mem" could be dropped?

>  ABI not.

With the comment in the public header saying "Users of this interface are
required to identify the granularity by other means" I wonder why the
interface needs to be byte-granular. If the caller needs to know page size
by whatever means, it can as well pass in a page count.

> Future TODOs:
>  * x86 shadow still rounds up.  This is buggy as it's a simultaneous equation
>    with tot_pages which varies over time with ballooning.
>  * x86 PV is weird.  There are no toolstack interact with the shadow pool
>    size, but the "shadow" pool it does come into existence when logdirty (or
>    pv-l1tf) when first enabled.
>  * The shadow+hap logic is in desperate need of deduping.

I have a tiny step towards this queued as post-XSA-410 work, folding HAP's
and shadow's freelist, total_pages, free_pages, and p2m_pages. Here this
would mean {hap,shadow}_get_allocation_bytes() could be done away with,
having the logic exclusively in paging.c.

> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -100,6 +100,14 @@ unsigned int p2m_get_allocation(struct domain *d)
>      return ROUNDUP(nr_pages, 1 << (20 - PAGE_SHIFT)) >> (20 - PAGE_SHIFT);
>  }
>  
> +/* Return the size of the pool, in bytes. */
> +int arch_get_p2m_mempool_size(struct domain *d, uint64_t *size)
> +{
> +    *size = ACCESS_ONCE(d->arch.paging.p2m_total_pages) << PAGE_SHIFT;

This may overflow for Arm32.

> @@ -157,6 +165,25 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>      return 0;
>  }
>  
> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size)
> +{
> +    unsigned long pages = size >> PAGE_SHIFT;
> +    bool preempted = false;
> +    int rc;
> +
> +    if ( (size & ~PAGE_MASK) ||          /* Non page-sized request? */
> +         pages != (size >> PAGE_SHIFT) ) /* 32-bit overflow? */
> +        return -EINVAL;

Simply "(pages << PAGE_SHIFT) != size"? And then move the check into
common code?

> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -345,6 +345,16 @@ unsigned int hap_get_allocation(struct domain *d)
>              + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0));
>  }
>  
> +int hap_get_allocation_bytes(struct domain *d, uint64_t *size)
> +{
> +    unsigned long pages = (d->arch.paging.hap.total_pages +
> +                           d->arch.paging.hap.p2m_pages);

Unlike for Arm no ACCESS_ONCE() here? Also the addition can in
principle overflow, because being done only in 32 bits.

> --- a/xen/arch/x86/mm/paging.c
> +++ b/xen/arch/x86/mm/paging.c
> @@ -977,6 +977,45 @@ int __init paging_set_allocation(struct domain *d, unsigned int pages,
>  }
>  #endif
>  
> +int arch_get_p2m_mempool_size(struct domain *d, uint64_t *size)
> +{
> +    int rc;
> +
> +    if ( is_pv_domain(d) )
> +        return -EOPNOTSUPP;
> +
> +    if ( hap_enabled(d) )
> +        rc = hap_get_allocation_bytes(d, size);
> +    else
> +        rc = shadow_get_allocation_bytes(d, size);
> +
> +    return rc;
> +}
> +
> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size)
> +{
> +    unsigned long pages = size >> PAGE_SHIFT;
> +    bool preempted = false;
> +    int rc;
> +
> +    if ( is_pv_domain(d) )
> +        return -EOPNOTSUPP;

Why? You do say "PV is weird" in a post-commit-message remark, but why
do you want to retain this weirdness? Even if today the tool stack
doesn't set the size when enabling log-dirty mode, I'd view this as a
bug which could be addressed purely in the tool stack if this check
wasn't there.

> +    if ( size & ~PAGE_MASK )             /* Non page-sized request? */
> +        return -EINVAL;
> +
> +    ASSERT(paging_mode_enabled(d));

Not only with the PV aspect in mind - why? It looks reasonable to me
to set the pool size before enabling any paging mode.

> +    paging_lock(d);
> +    if ( hap_enabled(d) )
> +        rc = hap_set_allocation(d, pages, &preempted);
> +    else
> +        rc = shadow_set_allocation(d, pages, &preempted);

Potential truncation from the "unsigned long" -> "unsigned int"
conversions.

Jan