All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xen.org security team <security@xen.org>
To: xen-announce@lists.xen.org, xen-devel@lists.xen.org,
	xen-users@lists.xen.org, oss-security@lists.openwall.com
Cc: Xen.org security team <security-team-members@xen.org>
Subject: Xen Security Advisory 380 v3 (CVE-2021-28698) - long running loops in grant table handling
Date: Wed, 01 Sep 2021 09:30:50 +0000	[thread overview]
Message-ID: <E1mLMa2-0006VB-JU@xenbits.xenproject.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 6043 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2021-28698 / XSA-380
                               version 3

              long running loops in grant table handling

UPDATES IN VERSION 3
====================

New bugfix patch on top of the prior set.

ISSUE DESCRIPTION
=================

In order to properly monitor resource use, Xen maintains information on
the grant mappings a domain may create to map grants offered by other
domains.  In the process of carrying out certain actions, Xen would
iterate over all such entries, including ones which aren't in use
anymore and some which may have been created but never used.  If the
number of entries for a given domain is large enough, this iterating of
the entire table may tie up a CPU for too long, starving other domains
or causing issues in the hypervisor itself.

Note that a domain may map its own grants, i.e. there is no need for
multiple domains to be involved here.  A pair of "cooperating" guests
may, however, cause the effects to be more severe.

IMPACT
======

Malicious or buggy guest kernels may be able to mount a Denial of
Service (DoS) attack affecting the entire system.

VULNERABLE SYSTEMS
==================

All Xen versions from at least 3.2 onwards are vulnerable in principle.

Systems running with default settings are generally believed to not be
vulnerable.

MITIGATION
==========

The problem can be avoided by reducing the number of grant mappings Xen
would allow guests to establish.  For example, setting
"gnttab_max_maptrack_frames=256" on the Xen command line (available
from Xen 4.5 onwards) or "max_maptrack_frames=256" in the xl domain
configurations (available from Xen 4.10 onwards) may be low enough for
all hardware Xen is able to run on.  Note that except for driver
domains, guests don't normally have a need to establish grant mappings,
i.e. they may be okay to run with "max_maptrack_frames=0" in their xl
domain configurations.

- From Xen 4.14 onwards it is also possible to alter the system wide upper
bound of the number of grant mappings Xen would allow guests to
establish by writing to the /params/gnttab_max_maptrack_frames
hypervisor file system node.  Note however that changing the value this
way will only affect guests yet to be created on the respective host.

CREDITS
=======

This issue was discovered by Andrew Cooper of Citrix.

RESOLUTION
==========

Applying the appropriate pair of attached patches resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

To address an issue observed with some compiler versions, another patch
needs to be applied on top. This has been committed to the unstable
tree (commit hash b6da9d0414d69c2682214ee3ecf9816fcac500d0) only so far.
This patch is listed last below.

xsa380/xsa380-[12].patch        xen-unstable - Xen 4.15.x
xsa380/xsa380-4.14-?.patch      Xen 4.14.x
xsa380/xsa380-4.13-?.patch      Xen 4.13.x
xsa380/xsa380-4.12-?.patch      Xen 4.12.x
xsa380/xsa380-4.11-?.patch      Xen 4.11.x
xsa380/xsa380-3.patch           Xen 4.15.x - Xen 4.11.x

$ sha256sum xsa380* xsa380*/*
3b1938277665c195f6822a1170c50a853efa6bc3dcd6b2b0b9bbb0849a57bbf6  xsa380.meta
65411a0fd05d534ed2238b259aa2877331ac0e79f2dda80b424f34fffcce108a  xsa380/xsa380-1.patch
e6cd6d345abaad38e10d6f680fe881e874e35a7295199e5430bacd209f0b7304  xsa380/xsa380-2.patch
1ead53a28dedb0a502a700d9ea933e89e385c1582f4790f558a11d0d39e9d374  xsa380/xsa380-3.patch
f3b486aa99a75ab54f9e26e21a721a413f993b27dbc3e6f2fda976fe20ddbae3  xsa380/xsa380-4.11-1.patch
96a09c05ca87fe3590064ca6d269ca47b97c732401cb593ff1068ac91009f51a  xsa380/xsa380-4.11-2.patch
496063cd4641258e1854a77f626cdd86c866c3ed8603bdc2ff9ab709008c84a7  xsa380/xsa380-4.12-1.patch
d850e1263e89c7a718f2cddcfb639fe4a5095a1852fc35499ed16a4075d225e5  xsa380/xsa380-4.12-2.patch
c23d34527a2ec68015ad78cd90365e4d80bce842ce01eeaa8cd2246021a55693  xsa380/xsa380-4.13-1.patch
bd40ce749d02f343c79325488ac1348e1c9e88e698bbad351bdb0a0d3995f3e0  xsa380/xsa380-4.13-2.patch
9b06085f9a4b93c465563cd76fdc682ddb9dba968c214259425b234e7053a809  xsa380/xsa380-4.14-1.patch
bf6b53880abd53b56226080d1a32839c7c8c459867104697cd76eeafc2ea382a  xsa380/xsa380-4.14-2.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.  HOWEVER, care has to be taken to avoid restricting
guests too much, as them suddenly being unable to map grants they used
to be able to map may lead to re-discovery of the issue.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmEvSDUMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZfboH/jtTLKbkb/K4SPliIDSRym1+wx17Ho30F9IRfd7E
ZdUBHfFLTEbZ/GvU+UkU914XeK49fCHZGylu2dcIRWgPmXFlqOtydFJny9a8sCjx
7/dLR1WXvhJ4eG8Hsma13uRzZuaF4JdXDp1M/QOr6aBHNTLBgMefbQNTAbFO0kzP
2xK3arGefi/eNa3GOEghrMlt2U/5r1h7PfLXBl69BETomOCOo5gyn6aQwqH/xKSY
W2cmqN0hYLyQSA+jx6QuTknsvSLS6NWWPEk/xfs31Op0E/P3tga0u6hCbRvWzOmE
AdISyyP8ogIBU4ekMn0H4yQoWKVgYctk3XAawBmogv+VJXI=
=7wyB
-----END PGP SIGNATURE-----

[-- Attachment #2: xsa380.meta --]
[-- Type: application/octet-stream, Size: 1880 bytes --]

{
  "XSA": 380,
  "SupportedVersions": [
    "master",
    "4.15",
    "4.14",
    "4.13",
    "4.12",
    "4.11"
  ],
  "Trees": [
    "xen"
  ],
  "Recipes": {
    "4.11": {
      "Recipes": {
        "xen": {
          "StableRef": "ef32c7afa2731b758226d6e10a1e489b1a15fc41",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-4.11-?.patch"
          ]
        }
      }
    },
    "4.12": {
      "Recipes": {
        "xen": {
          "StableRef": "ea20eee97e9e0861127a8070cc7b9ae3557b09fb",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-4.12-?.patch"
          ]
        }
      }
    },
    "4.13": {
      "Recipes": {
        "xen": {
          "StableRef": "32d580902b959000d79d51dff03a3560653c4fcb",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-4.13-?.patch"
          ]
        }
      }
    },
    "4.14": {
      "Recipes": {
        "xen": {
          "StableRef": "49299c4813b7847d29df07bf790f5489060f2a9c",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-4.14-?.patch"
          ]
        }
      }
    },
    "4.15": {
      "Recipes": {
        "xen": {
          "StableRef": "dba774896f7dd74773c14d537643b7d7477fefcd",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-?.patch"
          ]
        }
      }
    },
    "master": {
      "Recipes": {
        "xen": {
          "StableRef": "25da9455f1bb8a6d33039575a7b28bdfc4e3fcfe",
          "Prereqs": [
            378,
            379
          ],
          "Patches": [
            "xsa380/xsa380-?.patch"
          ]
        }
      }
    }
  }
}

[-- Attachment #3: xsa380/xsa380-1.patch --]
[-- Type: application/octet-stream, Size: 6390 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
---
While I consider removal of the freeing of t->maptrack[i] from
grant_table_destroy() an integral part of this change, also freeing
t->maptrack right in gnttab_release_mappings() would seem like an
unrelated change to me, so I'm not moving that one for now. If others
think it would better be moved, I'd be happy to do so.

While in principle it would be nice to also eliminate the other loops
from grant_table_destroy() (which can all take long as well provided a
large enough max_grant_frames), ->maptrack[] really is special in that
it only gets accessed when processing requests by the domain itself. The
other arrays may all continue to be accessed as remote domains drop uses
of grants by the dying domain.
---
v3: Add comment.
v2: Move gnttab_release_mappings() invocation into domain_teardown().
    Don't crash when cleaning up domain without maptrack table. Extend
    comment next to maptrack_limit.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -412,11 +412,18 @@ static int domain_teardown(struct domain
         v = d->teardown.vcpu
 
         enum {
-            PROG_vcpu_teardown = 1,
+            PROG_none,
+            PROG_gnttab_mappings,
+            PROG_vcpu_teardown,
             PROG_done,
         };
 
-    case 0:
+    case PROG_none:
+        rc = gnttab_release_mappings(d);
+        if ( rc )
+            return rc;
+
+    PROGRESS(gnttab_mappings):
         for_each_vcpu ( d, v )
         {
             PROGRESS_VCPU(teardown);
@@ -908,7 +915,6 @@ int domain_kill(struct domain *d)
             return domain_kill(d);
         d->is_dying = DOMDYING_dying;
         argo_destroy(d);
-        gnttab_release_mappings(d);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -64,7 +64,13 @@ struct grant_table {
     unsigned int          nr_grant_frames;
     /* Number of grant status frames shared with guest (for version 2) */
     unsigned int          nr_status_frames;
-    /* Number of available maptrack entries. */
+    /*
+     * Number of available maptrack entries.  For cleanup purposes it is
+     * important to realize that this field and @maptrack further down will
+     * only ever be accessed by the local domain.  Thus it is okay to clean
+     * up early, and to shrink the limit for the purpose of tracking cleanup
+     * progress.
+     */
     unsigned int          maptrack_limit;
     /* Shared grant table (see include/public/grant_table.h). */
     union {
@@ -3679,9 +3685,7 @@ do_grant_table_op(
 #include "compat/grant_table.c"
 #endif
 
-void
-gnttab_release_mappings(
-    struct domain *d)
+int gnttab_release_mappings(struct domain *d)
 {
     struct grant_table   *gt = d->grant_table, *rgt;
     struct grant_mapping *map;
@@ -3695,8 +3699,32 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    if ( !gt || !gt->maptrack )
+        return 0;
+
+    for ( handle = gt->maptrack_limit; handle; )
     {
+        /*
+         * Deal with full pages such that their freeing (in the body of the
+         * if()) remains simple.
+         */
+        if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) )
+        {
+            /*
+             * Changing maptrack_limit alters nr_maptrack_frames()'es return
+             * value. Free the then excess trailing page right here, rather
+             * than leaving it to grant_table_destroy() (and in turn requiring
+             * to leave gt->maptrack_limit unaltered).
+             */
+            gt->maptrack_limit = handle;
+            FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]);
+
+            if ( hypercall_preempt_check() )
+                return -ERESTART;
+        }
+
+        --handle;
+
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
             continue;
@@ -3780,6 +3808,11 @@ gnttab_release_mappings(
 
         map->flags = 0;
     }
+
+    gt->maptrack_limit = 0;
+    FREE_XENHEAP_PAGE(gt->maptrack[0]);
+
+    return 0;
 }
 
 void grant_table_warn_active_grants(struct domain *d)
@@ -3843,8 +3876,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
-        free_xenheap_page(t->maptrack[i]);
+    ASSERT(!t->maptrack_limit);
     vfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -47,9 +47,7 @@ void grant_table_init_vcpu(struct vcpu *
 void grant_table_warn_active_grants(struct domain *d);
 
 /* Domain death release of granted mappings of other domains' memory. */
-void
-gnttab_release_mappings(
-    struct domain *d);
+int gnttab_release_mappings(struct domain *d);
 
 int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
                             gfn_t *gfn, uint16_t *status);
@@ -80,7 +78,7 @@ static inline void grant_table_init_vcpu
 
 static inline void grant_table_warn_active_grants(struct domain *d) {}
 
-static inline void gnttab_release_mappings(struct domain *d) {}
+static inline int gnttab_release_mappings(struct domain *d) { return 0; }
 
 static inline int mem_sharing_gref_to_gfn(struct grant_table *gt,
                                           grant_ref_t ref,

[-- Attachment #4: xsa380/xsa380-2.patch --]
[-- Type: application/octet-stream, Size: 13903 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
---
I hope that limiting the map count to 32k on Arm32 is good enough. I
also hope it is out of question that 2G of mappings are enough on 64-bit
architectures.

I'm using the grant table lock for synchronization to limit differences
in behavior to prior code. I think in principle the maptrack lock could
be used equally well.

Shouldn't IOMMU insertions be limited anyway to GNTMAP_device_map
requests? This would further save on the number of radix tree nodes in
need of maintaining.

I'm hesitant to introduce GNTST_* in a security patch, but being able to
tell allocation failure or counter overflow from other errors might be
worthwhile.

I don't think adding anything to gnttab_usage_print() is useful:
radix_tree_gang_lookup() requires nodes to record their own indexes into
the tree, which we don't do to save space. Yet without indexes printing
node contents isn't very useful. Plus there's also no printing of the
main maptrack table contents.
---
v3: Check for radix_tree_lookup_slot() returning NULL. Convert -EEXIST
    to -EBUSY. Add comments. Re-base over comment addition in patch 1.
v2: New.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -37,6 +37,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
+#include <xen/radix-tree.h>
 #include <xen/vmap.h>
 #include <xen/nospec.h>
 #include <xsm/xsm.h>
@@ -82,8 +83,13 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table per vcpu. */
+    /* Handle-indexed tracking table of mappings. */
     struct grant_mapping **maptrack;
+    /*
+     * MFN-indexed tracking tree of mappings, if needed.  Note that this is
+     * protected by @lock, not @maptrack_lock.
+     */
+    struct radix_tree_root maptrack_tree;
 
     /* Domain to which this struct grant_table belongs. */
     const struct domain *domain;
@@ -516,34 +522,6 @@ static int get_paged_frame(unsigned long
     return GNTST_okay;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    /*
-     * See mapkind() for why the write lock is also required for the
-     * remote domain.
-     */
-    if ( lgt < rgt )
-    {
-        grant_write_lock(lgt);
-        grant_write_lock(rgt);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            grant_write_lock(rgt);
-        grant_write_lock(lgt);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    grant_write_unlock(lgt);
-    if ( lgt != rgt )
-        grant_write_unlock(rgt);
-}
-
 #define INVALID_MAPTRACK_HANDLE UINT_MAX
 
 static inline grant_handle_t
@@ -970,41 +948,17 @@ static struct active_grant_entry *grant_
     return ERR_PTR(-EINVAL);
 }
 
-#define MAPKIND_READ 1
-#define MAPKIND_WRITE 2
-static unsigned int mapkind(
-    struct grant_table *lgt, const struct domain *rd, mfn_t mfn)
-{
-    struct grant_mapping *map;
-    grant_handle_t handle, limit = lgt->maptrack_limit;
-    unsigned int kind = 0;
-
-    /*
-     * Must have the local domain's grant table write lock when
-     * iterating over its maptrack entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
-    /*
-     * Must have the remote domain's grant table write lock while
-     * counting its active entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
-
-    smp_rmb();
-
-    for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ )
-    {
-        map = &maptrack_entry(lgt, handle);
-        if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
-             map->domid != rd->domain_id )
-            continue;
-        if ( mfn_eq(_active_entry(rd->grant_table, map->ref).mfn, mfn) )
-            kind |= map->flags & GNTMAP_readonly ?
-                    MAPKIND_READ : MAPKIND_WRITE;
-    }
-
-    return kind;
-}
+union maptrack_node {
+    struct {
+        /* Radix tree slot pointers use two of the bits. */
+#ifdef __BIG_ENDIAN_BITFIELD
+        unsigned long    : 2;
+#endif
+        unsigned long rd : BITS_PER_LONG / 2 - 1;
+        unsigned long wr : BITS_PER_LONG / 2 - 1;
+    } cnt;
+    unsigned long raw;
+};
 
 static void
 map_grant_ref(
@@ -1023,7 +977,6 @@ map_grant_ref(
     struct grant_mapping *mt;
     grant_entry_header_t *shah;
     uint16_t *status;
-    bool_t need_iommu;
 
     ld = current->domain;
 
@@ -1244,31 +1197,75 @@ map_grant_ref(
      * as mem-sharing and IOMMU use are incompatible). The dom_io case would
      * need checking separately if we compared against owner here.
      */
-    need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
-    if ( need_iommu )
+    if ( ld != rd && gnttab_need_iommu_mapping(ld) )
     {
+        union maptrack_node node = {
+            .cnt.rd = !!(op->flags & GNTMAP_readonly),
+            .cnt.wr = !(op->flags & GNTMAP_readonly),
+        };
+        int err;
+        void **slot = NULL;
         unsigned int kind;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+
+        err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(mfn),
+                                radix_tree_ulong_to_ptr(node.raw));
+        if ( err == -EEXIST )
+        {
+            slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(mfn));
+            if ( likely(slot) )
+            {
+                node.raw = radix_tree_ptr_to_ulong(*slot);
+                err = -EBUSY;
+
+                /* Update node only when refcount doesn't overflow. */
+                if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd
+                                                 : ++node.cnt.wr )
+                {
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                    err = 0;
+                }
+            }
+            else
+                ASSERT_UNREACHABLE();
+        }
 
         /*
          * We're not translated, so we know that dfns and mfns are
          * the same things, so the IOMMU entry is always 1-to-1.
          */
-        kind = mapkind(lgt, rd, mfn);
-        if ( !(op->flags & GNTMAP_readonly) &&
-             !(kind & MAPKIND_WRITE) )
+        if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 )
             kind = IOMMUF_readable | IOMMUF_writable;
-        else if ( !kind )
+        else if ( (op->flags & GNTMAP_readonly) &&
+                  node.cnt.rd == 1 && !node.cnt.wr )
             kind = IOMMUF_readable;
         else
             kind = 0;
-        if ( kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 1, kind) )
+        if ( err ||
+             (kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 1, kind)) )
         {
-            double_gt_unlock(lgt, rgt);
+            if ( !err )
+            {
+                if ( slot )
+                {
+                    op->flags & GNTMAP_readonly ? node.cnt.rd--
+                                                : node.cnt.wr--;
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                }
+                else
+                    radix_tree_delete(&lgt->maptrack_tree, mfn_x(mfn));
+            }
+
             rc = GNTST_general_error;
-            goto undo_out;
         }
+
+        grant_write_unlock(lgt);
+
+        if ( rc != GNTST_okay )
+            goto undo_out;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom);
@@ -1276,10 +1273,6 @@ map_grant_ref(
     /*
      * All maptrack entry users check mt->flags first before using the
      * other fields so just ensure the flags field is stored last.
-     *
-     * However, if gnttab_need_iommu_mapping() then this would race
-     * with a concurrent mapkind() call (on an unmap, for example)
-     * and a lock is required.
      */
     mt = &maptrack_entry(lgt, handle);
     mt->domid = op->dom;
@@ -1287,9 +1280,6 @@ map_grant_ref(
     smp_wmb();
     write_atomic(&mt->flags, op->flags);
 
-    if ( need_iommu )
-        double_gt_unlock(lgt, rgt);
-
     op->dev_bus_addr = mfn_to_maddr(mfn);
     op->handle       = handle;
     op->status       = GNTST_okay;
@@ -1497,19 +1487,34 @@ unmap_common(
     /* See the respective comment in map_grant_ref(). */
     if ( rc == GNTST_okay && ld != rd && gnttab_need_iommu_mapping(ld) )
     {
-        unsigned int kind;
+        void **slot;
+        union maptrack_node node;
         int err = 0;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+        slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->mfn));
+        node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0;
+
+        /* Refcount must not underflow. */
+        if ( !(flags & GNTMAP_readonly ? node.cnt.rd--
+                                       : node.cnt.wr--) )
+            BUG();
 
-        kind = mapkind(lgt, rd, op->mfn);
-        if ( !kind )
+        if ( !node.raw )
             err = iommu_legacy_unmap(ld, _dfn(mfn_x(op->mfn)), 1);
-        else if ( !(kind & MAPKIND_WRITE) )
+        else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr )
             err = iommu_legacy_map(ld, _dfn(mfn_x(op->mfn)), op->mfn, 1,
                                    IOMMUF_readable);
 
-        double_gt_unlock(lgt, rgt);
+        if ( err )
+            ;
+        else if ( !node.raw )
+            radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->mfn));
+        else
+            radix_tree_replace_slot(slot,
+                                    radix_tree_ulong_to_ptr(node.raw));
+
+        grant_write_unlock(lgt);
 
         if ( err )
             rc = GNTST_general_error;
@@ -1956,6 +1961,8 @@ int grant_table_init(struct domain *d, i
         gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
         if ( gt->maptrack == NULL )
             goto out;
+
+        radix_tree_init(&gt->maptrack_tree);
     }
 
     /* Shared grant table. */
@@ -3704,6 +3711,8 @@ int gnttab_release_mappings(struct domai
 
     for ( handle = gt->maptrack_limit; handle; )
     {
+        mfn_t mfn;
+
         /*
          * Deal with full pages such that their freeing (in the body of the
          * if()) remains simple.
@@ -3801,17 +3810,31 @@ int gnttab_release_mappings(struct domai
 
         reduce_status_for_pin(rd, act, status, map->flags & GNTMAP_readonly);
 
+        mfn = act->mfn;
+
         active_entry_release(act);
         grant_read_unlock(rgt);
 
         rcu_unlock_domain(rd);
 
         map->flags = 0;
+
+        /*
+         * This is excessive in that a single such call would suffice per
+         * mapped MFN (or none at all, if no entry was ever inserted). But it
+         * should be the common case for an MFN to be mapped just once, and
+         * this way we don't need to further maintain the counters. We also
+         * don't want to leave cleaning up of the tree as a whole to the end
+         * of the function, as this could take quite some time.
+         */
+        radix_tree_delete(&gt->maptrack_tree, mfn_x(mfn));
     }
 
     gt->maptrack_limit = 0;
     FREE_XENHEAP_PAGE(gt->maptrack[0]);
 
+    radix_tree_destroy(&gt->maptrack_tree, NULL);
+
     return 0;
 }
 
--- a/xen/include/xen/radix-tree.h
+++ b/xen/include/xen/radix-tree.h
@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int(
     return (int)((long)ptr >> 2);
 }
 
+/**
+ * radix_tree_{ulong_to_ptr,ptr_to_ulong}:
+ *
+ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2
+ * bits are actually usable for the value.
+ */
+static inline void *radix_tree_ulong_to_ptr(unsigned long val)
+{
+    unsigned long ptr = (val << 2) | 0x2;
+    ASSERT((ptr >> 2) == val);
+    return (void *)ptr;
+}
+
+static inline unsigned long radix_tree_ptr_to_ulong(void *ptr)
+{
+    ASSERT(((unsigned long)ptr & 0x3) == 0x2);
+    return (unsigned long)ptr >> 2;
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);

[-- Attachment #5: xsa380/xsa380-3.patch --]
[-- Type: application/octet-stream, Size: 3000 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr()

Relevant quotes from the C11 standard:

"Except where explicitly stated otherwise, for the purposes of this
 subclause unnamed members of objects of structure and union type do not
 participate in initialization. Unnamed members of structure objects
 have indeterminate value even after initialization."

"If there are fewer initializers in a brace-enclosed list than there are
 elements or members of an aggregate, [...], the remainder of the
 aggregate shall be initialized implicitly the same as objects that have
 static storage duration."

"If an object that has static or thread storage duration is not
 initialized explicitly, then:
 [...]
 — if it is an aggregate, every member is initialized (recursively)
   according to these rules, and any padding is initialized to zero
   bits;
 [...]"

"A bit-field declaration with no declarator, but only a colon and a
 width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field
 structure member is useful for padding to conform to externally imposed
 layouts."

"There may be unnamed padding within a structure object, but not at its
 beginning."

Which makes me conclude:
- Whether an unnamed bit-field member is an unnamed member or padding is
  unclear, and hence also whether the last quote above would render the
  big endian case of the structure declaration invalid.
- Whether the number of members of an aggregate includes unnamed ones is
  also not really clear.
- The initializer in map_grant_ref() initializes all fields of the "cnt"
  sub-structure of the union, so assuming the second quote above applies
  here (indirectly), the compiler isn't required to implicitly
  initialize the rest (i.e. in particular any padding) like would happen
  for static storage duration objects.

Gcc 7.4.1 can be observed (apparently in debug builds only) to translate
aforementioned initializer to a read-modify-write operation of a stack
variable, leaving unchanged the top two bits of whatever was previously
in that stack slot. Clearly if either of the two bits were set,
radix_tree_ulong_to_ptr()'s assertion would trigger.

Therefore, to be on the safe side, add an explicit padding field for the
non-big-endian-bitfields case and give a dummy name to both padding
fields.

Fixes: 9781b51efde2 ("gnttab: replace mapkind()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -952,10 +952,13 @@ union maptrack_node {
     struct {
         /* Radix tree slot pointers use two of the bits. */
 #ifdef __BIG_ENDIAN_BITFIELD
-        unsigned long    : 2;
+        unsigned long _0 : 2;
 #endif
         unsigned long rd : BITS_PER_LONG / 2 - 1;
         unsigned long wr : BITS_PER_LONG / 2 - 1;
+#ifndef __BIG_ENDIAN_BITFIELD
+        unsigned long _0 : 2;
+#endif
     } cnt;
     unsigned long raw;
 };

[-- Attachment #6: xsa380/xsa380-4.11-1.patch --]
[-- Type: application/octet-stream, Size: 4812 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -646,13 +646,15 @@ int domain_kill(struct domain *d)
         if ( d->is_dying != DOMDYING_alive )
             return domain_kill(d);
         d->is_dying = DOMDYING_dying;
-        gnttab_release_mappings(d);
         tmem_destroy(d->tmem_client);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         d->tmem_client = NULL;
         /* fallthrough */
     case DOMDYING_dying:
+        rc = gnttab_release_mappings(d);
+        if ( rc )
+            break;
         rc = evtchn_destroy(d);
         if ( rc )
             break;
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -62,7 +62,13 @@ struct grant_table {
     unsigned int          nr_grant_frames;
     /* Number of grant status frames shared with guest (for version 2) */
     unsigned int          nr_status_frames;
-    /* Number of available maptrack entries. */
+    /*
+     * Number of available maptrack entries.  For cleanup purposes it is
+     * important to realize that this field and @maptrack further down will
+     * only ever be accessed by the local domain.  Thus it is okay to clean
+     * up early, and to shrink the limit for the purpose of tracking cleanup
+     * progress.
+     */
     unsigned int          maptrack_limit;
     /* Shared grant table (see include/public/grant_table.h). */
     union {
@@ -3618,9 +3624,7 @@ grant_table_create(
     return ret;
 }
 
-void
-gnttab_release_mappings(
-    struct domain *d)
+int gnttab_release_mappings(struct domain *d)
 {
     struct grant_table   *gt = d->grant_table, *rgt;
     struct grant_mapping *map;
@@ -3634,8 +3638,32 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    if ( !gt || !gt->maptrack )
+        return 0;
+
+    for ( handle = gt->maptrack_limit; handle; )
     {
+        /*
+         * Deal with full pages such that their freeing (in the body of the
+         * if()) remains simple.
+         */
+        if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) )
+        {
+            /*
+             * Changing maptrack_limit alters nr_maptrack_frames()'es return
+             * value. Free the then excess trailing page right here, rather
+             * than leaving it to grant_table_destroy() (and in turn requiring
+             * to leave gt->maptrack_limit unaltered).
+             */
+            gt->maptrack_limit = handle;
+            FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]);
+
+            if ( hypercall_preempt_check() )
+                return -ERESTART;
+        }
+
+        --handle;
+
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
             continue;
@@ -3723,6 +3751,11 @@ gnttab_release_mappings(
 
         map->flags = 0;
     }
+
+    gt->maptrack_limit = 0;
+    FREE_XENHEAP_PAGE(gt->maptrack[0]);
+
+    return 0;
 }
 
 void grant_table_warn_active_grants(struct domain *d)
@@ -3785,8 +3818,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
-        free_xenheap_page(t->maptrack[i]);
+    ASSERT(!t->maptrack_limit);
     vfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -46,9 +46,7 @@ int grant_table_set_limits(struct domain
 void grant_table_warn_active_grants(struct domain *d);
 
 /* Domain death release of granted mappings of other domains' memory. */
-void
-gnttab_release_mappings(
-    struct domain *d);
+int gnttab_release_mappings(struct domain *d);
 
 int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
                             gfn_t *gfn, uint16_t *status);

[-- Attachment #7: xsa380/xsa380-4.11-2.patch --]
[-- Type: application/octet-stream, Size: 12637 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -36,6 +36,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
+#include <xen/radix-tree.h>
 #include <xen/vmap.h>
 #include <xsm/xsm.h>
 #include <asm/flushtlb.h>
@@ -80,8 +81,13 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table per vcpu. */
+    /* Handle-indexed tracking table of mappings. */
     struct grant_mapping **maptrack;
+    /*
+     * MFN-indexed tracking tree of mappings, if needed.  Note that this is
+     * protected by @lock, not @maptrack_lock.
+     */
+    struct radix_tree_root maptrack_tree;
 
     /* Domain to which this struct grant_table belongs. */
     const struct domain *domain;
@@ -421,34 +427,6 @@ static int get_paged_frame(unsigned long
     return rc;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    /*
-     * See mapkind() for why the write lock is also required for the
-     * remote domain.
-     */
-    if ( lgt < rgt )
-    {
-        grant_write_lock(lgt);
-        grant_write_lock(rgt);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            grant_write_lock(rgt);
-        grant_write_lock(lgt);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    grant_write_unlock(lgt);
-    if ( lgt != rgt )
-        grant_write_unlock(rgt);
-}
-
 #define INVALID_MAPTRACK_HANDLE UINT_MAX
 
 static inline grant_handle_t
@@ -871,41 +849,17 @@ static struct active_grant_entry *grant_
     return ERR_PTR(-EINVAL);
 }
 
-#define MAPKIND_READ 1
-#define MAPKIND_WRITE 2
-static unsigned int mapkind(
-    struct grant_table *lgt, const struct domain *rd, mfn_t mfn)
-{
-    struct grant_mapping *map;
-    grant_handle_t handle, limit = lgt->maptrack_limit;
-    unsigned int kind = 0;
-
-    /*
-     * Must have the local domain's grant table write lock when
-     * iterating over its maptrack entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
-    /*
-     * Must have the remote domain's grant table write lock while
-     * counting its active entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
-
-    smp_rmb();
-
-    for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ )
-    {
-        map = &maptrack_entry(lgt, handle);
-        if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
-             map->domid != rd->domain_id )
-            continue;
-        if ( mfn_eq(_active_entry(rd->grant_table, map->ref).frame, mfn) )
-            kind |= map->flags & GNTMAP_readonly ?
-                    MAPKIND_READ : MAPKIND_WRITE;
-    }
-
-    return kind;
-}
+union maptrack_node {
+    struct {
+        /* Radix tree slot pointers use two of the bits. */
+#ifdef __BIG_ENDIAN_BITFIELD
+        unsigned long    : 2;
+#endif
+        unsigned long rd : BITS_PER_LONG / 2 - 1;
+        unsigned long wr : BITS_PER_LONG / 2 - 1;
+    } cnt;
+    unsigned long raw;
+};
 
 /*
  * Returns 0 if TLB flush / invalidate required by caller.
@@ -931,7 +885,6 @@ map_grant_ref(
     struct grant_mapping *mt;
     grant_entry_header_t *shah;
     uint16_t *status;
-    bool_t need_iommu;
 
     led = current;
     ld = led->domain;
@@ -1139,31 +1092,75 @@ map_grant_ref(
         goto undo_out;
     }
 
-    need_iommu = gnttab_need_iommu_mapping(ld);
-    if ( need_iommu )
+    if ( gnttab_need_iommu_mapping(ld) )
     {
+        union maptrack_node node = {
+            .cnt.rd = !!(op->flags & GNTMAP_readonly),
+            .cnt.wr = !(op->flags & GNTMAP_readonly),
+        };
+        int err;
+        void **slot = NULL;
         unsigned int kind;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+
+        err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(frame),
+                                radix_tree_ulong_to_ptr(node.raw));
+        if ( err == -EEXIST )
+        {
+            slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(frame));
+            if ( likely(slot) )
+            {
+                node.raw = radix_tree_ptr_to_ulong(*slot);
+                err = -EBUSY;
+
+                /* Update node only when refcount doesn't overflow. */
+                if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd
+                                                 : ++node.cnt.wr )
+                {
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                    err = 0;
+                }
+            }
+            else
+                ASSERT_UNREACHABLE();
+        }
 
         /*
          * We're not translated, so we know that dfns and mfns are
          * the same things, so the IOMMU entry is always 1-to-1.
          */
-        kind = mapkind(lgt, rd, frame);
-        if ( !(op->flags & GNTMAP_readonly) &&
-             !(kind & MAPKIND_WRITE) )
+        if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 )
             kind = IOMMUF_readable | IOMMUF_writable;
-        else if ( !kind )
+        else if ( (op->flags & GNTMAP_readonly) &&
+                  node.cnt.rd == 1 && !node.cnt.wr )
             kind = IOMMUF_readable;
         else
             kind = 0;
-        if ( kind && iommu_map_page(ld, mfn_x(frame), mfn_x(frame), kind) )
+        if ( err ||
+             (kind && iommu_map_page(ld, mfn_x(frame), mfn_x(frame), kind)) )
         {
-            double_gt_unlock(lgt, rgt);
+            if ( !err )
+            {
+                if ( slot )
+                {
+                    op->flags & GNTMAP_readonly ? node.cnt.rd--
+                                                : node.cnt.wr--;
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                }
+                else
+                    radix_tree_delete(&lgt->maptrack_tree, mfn_x(frame));
+            }
+
             rc = GNTST_general_error;
-            goto undo_out;
         }
+
+        grant_write_unlock(lgt);
+
+        if ( rc != GNTST_okay )
+            goto undo_out;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom);
@@ -1171,10 +1168,6 @@ map_grant_ref(
     /*
      * All maptrack entry users check mt->flags first before using the
      * other fields so just ensure the flags field is stored last.
-     *
-     * However, if gnttab_need_iommu_mapping() then this would race
-     * with a concurrent mapkind() call (on an unmap, for example)
-     * and a lock is required.
      */
     mt = &maptrack_entry(lgt, handle);
     mt->domid = op->dom;
@@ -1182,9 +1175,6 @@ map_grant_ref(
     smp_wmb();
     write_atomic(&mt->flags, op->flags);
 
-    if ( need_iommu )
-        double_gt_unlock(lgt, rgt);
-
     op->dev_bus_addr = mfn_to_maddr(frame);
     op->handle       = handle;
     op->status       = GNTST_okay;
@@ -1411,19 +1401,34 @@ unmap_common(
 
     if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) )
     {
-        unsigned int kind;
+        void **slot;
+        union maptrack_node node;
         int err = 0;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+        slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->frame));
+        node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0;
+
+        /* Refcount must not underflow. */
+        if ( !(flags & GNTMAP_readonly ? node.cnt.rd--
+                                       : node.cnt.wr--) )
+            BUG();
 
-        kind = mapkind(lgt, rd, op->frame);
-        if ( !kind )
+        if ( !node.raw )
             err = iommu_unmap_page(ld, mfn_x(op->frame));
-        else if ( !(kind & MAPKIND_WRITE) )
+        else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr )
             err = iommu_map_page(ld, mfn_x(op->frame),
                                  mfn_x(op->frame), IOMMUF_readable);
 
-        double_gt_unlock(lgt, rgt);
+        if ( err )
+            ;
+        else if ( !node.raw )
+            radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->frame));
+        else
+            radix_tree_replace_slot(slot,
+                                    radix_tree_ulong_to_ptr(node.raw));
+
+        grant_write_unlock(lgt);
 
         if ( err )
             rc = GNTST_general_error;
@@ -1854,6 +1859,8 @@ grant_table_init(struct domain *d, struc
         gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
         if ( gt->maptrack == NULL )
             goto out;
+
+        radix_tree_init(&gt->maptrack_tree);
     }
 
     /* Shared grant table. */
@@ -3643,6 +3650,8 @@ int gnttab_release_mappings(struct domai
 
     for ( handle = gt->maptrack_limit; handle; )
     {
+        mfn_t mfn;
+
         /*
          * Deal with full pages such that their freeing (in the body of the
          * if()) remains simple.
@@ -3744,17 +3753,31 @@ int gnttab_release_mappings(struct domai
         if ( act->pin == 0 )
             gnttab_clear_flag(rd, _GTF_reading, status);
 
+        mfn = act->frame;
+
         active_entry_release(act);
         grant_read_unlock(rgt);
 
         rcu_unlock_domain(rd);
 
         map->flags = 0;
+
+        /*
+         * This is excessive in that a single such call would suffice per
+         * mapped MFN (or none at all, if no entry was ever inserted). But it
+         * should be the common case for an MFN to be mapped just once, and
+         * this way we don't need to further maintain the counters. We also
+         * don't want to leave cleaning up of the tree as a whole to the end
+         * of the function, as this could take quite some time.
+         */
+        radix_tree_delete(&gt->maptrack_tree, mfn_x(mfn));
     }
 
     gt->maptrack_limit = 0;
     FREE_XENHEAP_PAGE(gt->maptrack[0]);
 
+    radix_tree_destroy(&gt->maptrack_tree, NULL);
+
     return 0;
 }
 
--- a/xen/include/xen/radix-tree.h
+++ b/xen/include/xen/radix-tree.h
@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int(
     return (int)((long)ptr >> 2);
 }
 
+/**
+ * radix_tree_{ulong_to_ptr,ptr_to_ulong}:
+ *
+ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2
+ * bits are actually usable for the value.
+ */
+static inline void *radix_tree_ulong_to_ptr(unsigned long val)
+{
+    unsigned long ptr = (val << 2) | 0x2;
+    ASSERT((ptr >> 2) == val);
+    return (void *)ptr;
+}
+
+static inline unsigned long radix_tree_ptr_to_ulong(void *ptr)
+{
+    ASSERT(((unsigned long)ptr & 0x3) == 0x2);
+    return (unsigned long)ptr >> 2;
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);

[-- Attachment #8: xsa380/xsa380-4.12-1.patch --]
[-- Type: application/octet-stream, Size: 4812 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -724,13 +724,15 @@ int domain_kill(struct domain *d)
             return domain_kill(d);
         d->is_dying = DOMDYING_dying;
         argo_destroy(d);
-        gnttab_release_mappings(d);
         tmem_destroy(d->tmem_client);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         d->tmem_client = NULL;
         /* fallthrough */
     case DOMDYING_dying:
+        rc = gnttab_release_mappings(d);
+        if ( rc )
+            break;
         rc = evtchn_destroy(d);
         if ( rc )
             break;
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -62,7 +62,13 @@ struct grant_table {
     unsigned int          nr_grant_frames;
     /* Number of grant status frames shared with guest (for version 2) */
     unsigned int          nr_status_frames;
-    /* Number of available maptrack entries. */
+    /*
+     * Number of available maptrack entries.  For cleanup purposes it is
+     * important to realize that this field and @maptrack further down will
+     * only ever be accessed by the local domain.  Thus it is okay to clean
+     * up early, and to shrink the limit for the purpose of tracking cleanup
+     * progress.
+     */
     unsigned int          maptrack_limit;
     /* Shared grant table (see include/public/grant_table.h). */
     union {
@@ -3604,9 +3610,7 @@ do_grant_table_op(
 #include "compat/grant_table.c"
 #endif
 
-void
-gnttab_release_mappings(
-    struct domain *d)
+int gnttab_release_mappings(struct domain *d)
 {
     struct grant_table   *gt = d->grant_table, *rgt;
     struct grant_mapping *map;
@@ -3620,8 +3624,32 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    if ( !gt || !gt->maptrack )
+        return 0;
+
+    for ( handle = gt->maptrack_limit; handle; )
     {
+        /*
+         * Deal with full pages such that their freeing (in the body of the
+         * if()) remains simple.
+         */
+        if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) )
+        {
+            /*
+             * Changing maptrack_limit alters nr_maptrack_frames()'es return
+             * value. Free the then excess trailing page right here, rather
+             * than leaving it to grant_table_destroy() (and in turn requiring
+             * to leave gt->maptrack_limit unaltered).
+             */
+            gt->maptrack_limit = handle;
+            FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]);
+
+            if ( hypercall_preempt_check() )
+                return -ERESTART;
+        }
+
+        --handle;
+
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
             continue;
@@ -3709,6 +3737,11 @@ gnttab_release_mappings(
 
         map->flags = 0;
     }
+
+    gt->maptrack_limit = 0;
+    FREE_XENHEAP_PAGE(gt->maptrack[0]);
+
+    return 0;
 }
 
 void grant_table_warn_active_grants(struct domain *d)
@@ -3771,8 +3804,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
-        free_xenheap_page(t->maptrack[i]);
+    ASSERT(!t->maptrack_limit);
     vfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -46,9 +46,7 @@ void grant_table_init_vcpu(struct vcpu *
 void grant_table_warn_active_grants(struct domain *d);
 
 /* Domain death release of granted mappings of other domains' memory. */
-void
-gnttab_release_mappings(
-    struct domain *d);
+int gnttab_release_mappings(struct domain *d);
 
 int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
                             gfn_t *gfn, uint16_t *status);

[-- Attachment #9: xsa380/xsa380-4.12-2.patch --]
[-- Type: application/octet-stream, Size: 12601 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -36,6 +36,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
+#include <xen/radix-tree.h>
 #include <xen/vmap.h>
 #include <xsm/xsm.h>
 #include <asm/flushtlb.h>
@@ -80,8 +81,13 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table per vcpu. */
+    /* Handle-indexed tracking table of mappings. */
     struct grant_mapping **maptrack;
+    /*
+     * MFN-indexed tracking tree of mappings, if needed.  Note that this is
+     * protected by @lock, not @maptrack_lock.
+     */
+    struct radix_tree_root maptrack_tree;
 
     /* Domain to which this struct grant_table belongs. */
     const struct domain *domain;
@@ -445,34 +451,6 @@ static int get_paged_frame(unsigned long
     return GNTST_okay;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    /*
-     * See mapkind() for why the write lock is also required for the
-     * remote domain.
-     */
-    if ( lgt < rgt )
-    {
-        grant_write_lock(lgt);
-        grant_write_lock(rgt);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            grant_write_lock(rgt);
-        grant_write_lock(lgt);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    grant_write_unlock(lgt);
-    if ( lgt != rgt )
-        grant_write_unlock(rgt);
-}
-
 #define INVALID_MAPTRACK_HANDLE UINT_MAX
 
 static inline grant_handle_t
@@ -895,41 +873,17 @@ static struct active_grant_entry *grant_
     return ERR_PTR(-EINVAL);
 }
 
-#define MAPKIND_READ 1
-#define MAPKIND_WRITE 2
-static unsigned int mapkind(
-    struct grant_table *lgt, const struct domain *rd, mfn_t mfn)
-{
-    struct grant_mapping *map;
-    grant_handle_t handle, limit = lgt->maptrack_limit;
-    unsigned int kind = 0;
-
-    /*
-     * Must have the local domain's grant table write lock when
-     * iterating over its maptrack entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
-    /*
-     * Must have the remote domain's grant table write lock while
-     * counting its active entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
-
-    smp_rmb();
-
-    for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ )
-    {
-        map = &maptrack_entry(lgt, handle);
-        if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
-             map->domid != rd->domain_id )
-            continue;
-        if ( mfn_eq(_active_entry(rd->grant_table, map->ref).mfn, mfn) )
-            kind |= map->flags & GNTMAP_readonly ?
-                    MAPKIND_READ : MAPKIND_WRITE;
-    }
-
-    return kind;
-}
+union maptrack_node {
+    struct {
+        /* Radix tree slot pointers use two of the bits. */
+#ifdef __BIG_ENDIAN_BITFIELD
+        unsigned long    : 2;
+#endif
+        unsigned long rd : BITS_PER_LONG / 2 - 1;
+        unsigned long wr : BITS_PER_LONG / 2 - 1;
+    } cnt;
+    unsigned long raw;
+};
 
 static void
 map_grant_ref(
@@ -948,7 +902,6 @@ map_grant_ref(
     struct grant_mapping *mt;
     grant_entry_header_t *shah;
     uint16_t *status;
-    bool_t need_iommu;
 
     led = current;
     ld = led->domain;
@@ -1156,31 +1109,75 @@ map_grant_ref(
         goto undo_out;
     }
 
-    need_iommu = gnttab_need_iommu_mapping(ld);
-    if ( need_iommu )
+    if ( gnttab_need_iommu_mapping(ld) )
     {
+        union maptrack_node node = {
+            .cnt.rd = !!(op->flags & GNTMAP_readonly),
+            .cnt.wr = !(op->flags & GNTMAP_readonly),
+        };
+        int err;
+        void **slot = NULL;
         unsigned int kind;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+
+        err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(mfn),
+                                radix_tree_ulong_to_ptr(node.raw));
+        if ( err == -EEXIST )
+        {
+            slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(mfn));
+            if ( likely(slot) )
+            {
+                node.raw = radix_tree_ptr_to_ulong(*slot);
+                err = -EBUSY;
+
+                /* Update node only when refcount doesn't overflow. */
+                if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd
+                                                 : ++node.cnt.wr )
+                {
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                    err = 0;
+                }
+            }
+            else
+                ASSERT_UNREACHABLE();
+        }
 
         /*
          * We're not translated, so we know that dfns and mfns are
          * the same things, so the IOMMU entry is always 1-to-1.
          */
-        kind = mapkind(lgt, rd, mfn);
-        if ( !(op->flags & GNTMAP_readonly) &&
-             !(kind & MAPKIND_WRITE) )
+        if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 )
             kind = IOMMUF_readable | IOMMUF_writable;
-        else if ( !kind )
+        else if ( (op->flags & GNTMAP_readonly) &&
+                  node.cnt.rd == 1 && !node.cnt.wr )
             kind = IOMMUF_readable;
         else
             kind = 0;
-        if ( kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind) )
+        if ( err ||
+             (kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind)) )
         {
-            double_gt_unlock(lgt, rgt);
+            if ( !err )
+            {
+                if ( slot )
+                {
+                    op->flags & GNTMAP_readonly ? node.cnt.rd--
+                                                : node.cnt.wr--;
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                }
+                else
+                    radix_tree_delete(&lgt->maptrack_tree, mfn_x(mfn));
+            }
+
             rc = GNTST_general_error;
-            goto undo_out;
         }
+
+        grant_write_unlock(lgt);
+
+        if ( rc != GNTST_okay )
+            goto undo_out;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom);
@@ -1188,10 +1185,6 @@ map_grant_ref(
     /*
      * All maptrack entry users check mt->flags first before using the
      * other fields so just ensure the flags field is stored last.
-     *
-     * However, if gnttab_need_iommu_mapping() then this would race
-     * with a concurrent mapkind() call (on an unmap, for example)
-     * and a lock is required.
      */
     mt = &maptrack_entry(lgt, handle);
     mt->domid = op->dom;
@@ -1199,9 +1192,6 @@ map_grant_ref(
     smp_wmb();
     write_atomic(&mt->flags, op->flags);
 
-    if ( need_iommu )
-        double_gt_unlock(lgt, rgt);
-
     op->dev_bus_addr = mfn_to_maddr(mfn);
     op->handle       = handle;
     op->status       = GNTST_okay;
@@ -1414,19 +1404,34 @@ unmap_common(
 
     if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) )
     {
-        unsigned int kind;
+        void **slot;
+        union maptrack_node node;
         int err = 0;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+        slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->mfn));
+        node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0;
+
+        /* Refcount must not underflow. */
+        if ( !(flags & GNTMAP_readonly ? node.cnt.rd--
+                                       : node.cnt.wr--) )
+            BUG();
 
-        kind = mapkind(lgt, rd, op->mfn);
-        if ( !kind )
+        if ( !node.raw )
             err = iommu_legacy_unmap(ld, _dfn(mfn_x(op->mfn)), 0);
-        else if ( !(kind & MAPKIND_WRITE) )
+        else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr )
             err = iommu_legacy_map(ld, _dfn(mfn_x(op->mfn)), op->mfn, 0,
                                    IOMMUF_readable);
 
-        double_gt_unlock(lgt, rgt);
+        if ( err )
+            ;
+        else if ( !node.raw )
+            radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->mfn));
+        else
+            radix_tree_replace_slot(slot,
+                                    radix_tree_ulong_to_ptr(node.raw));
+
+        grant_write_unlock(lgt);
 
         if ( err )
             rc = GNTST_general_error;
@@ -1874,6 +1879,8 @@ int grant_table_init(struct domain *d, i
         gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
         if ( gt->maptrack == NULL )
             goto out;
+
+        radix_tree_init(&gt->maptrack_tree);
     }
 
     /* Shared grant table. */
@@ -3629,6 +3636,8 @@ int gnttab_release_mappings(struct domai
 
     for ( handle = gt->maptrack_limit; handle; )
     {
+        mfn_t mfn;
+
         /*
          * Deal with full pages such that their freeing (in the body of the
          * if()) remains simple.
@@ -3730,17 +3739,31 @@ int gnttab_release_mappings(struct domai
         if ( act->pin == 0 )
             gnttab_clear_flag(rd, _GTF_reading, status);
 
+        mfn = act->mfn;
+
         active_entry_release(act);
         grant_read_unlock(rgt);
 
         rcu_unlock_domain(rd);
 
         map->flags = 0;
+
+        /*
+         * This is excessive in that a single such call would suffice per
+         * mapped MFN (or none at all, if no entry was ever inserted). But it
+         * should be the common case for an MFN to be mapped just once, and
+         * this way we don't need to further maintain the counters. We also
+         * don't want to leave cleaning up of the tree as a whole to the end
+         * of the function, as this could take quite some time.
+         */
+        radix_tree_delete(&gt->maptrack_tree, mfn_x(mfn));
     }
 
     gt->maptrack_limit = 0;
     FREE_XENHEAP_PAGE(gt->maptrack[0]);
 
+    radix_tree_destroy(&gt->maptrack_tree, NULL);
+
     return 0;
 }
 
--- a/xen/include/xen/radix-tree.h
+++ b/xen/include/xen/radix-tree.h
@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int(
     return (int)((long)ptr >> 2);
 }
 
+/**
+ * radix_tree_{ulong_to_ptr,ptr_to_ulong}:
+ *
+ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2
+ * bits are actually usable for the value.
+ */
+static inline void *radix_tree_ulong_to_ptr(unsigned long val)
+{
+    unsigned long ptr = (val << 2) | 0x2;
+    ASSERT((ptr >> 2) == val);
+    return (void *)ptr;
+}
+
+static inline unsigned long radix_tree_ptr_to_ulong(void *ptr)
+{
+    ASSERT(((unsigned long)ptr & 0x3) == 0x2);
+    return (unsigned long)ptr >> 2;
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);

[-- Attachment #10: xsa380/xsa380-4.13-1.patch --]
[-- Type: application/octet-stream, Size: 5187 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -776,11 +776,13 @@ int domain_kill(struct domain *d)
             return domain_kill(d);
         d->is_dying = DOMDYING_dying;
         argo_destroy(d);
-        gnttab_release_mappings(d);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
     case DOMDYING_dying:
+        rc = gnttab_release_mappings(d);
+        if ( rc )
+            break;
         rc = evtchn_destroy(d);
         if ( rc )
             break;
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -63,7 +63,13 @@ struct grant_table {
     unsigned int          nr_grant_frames;
     /* Number of grant status frames shared with guest (for version 2) */
     unsigned int          nr_status_frames;
-    /* Number of available maptrack entries. */
+    /*
+     * Number of available maptrack entries.  For cleanup purposes it is
+     * important to realize that this field and @maptrack further down will
+     * only ever be accessed by the local domain.  Thus it is okay to clean
+     * up early, and to shrink the limit for the purpose of tracking cleanup
+     * progress.
+     */
     unsigned int          maptrack_limit;
     /* Shared grant table (see include/public/grant_table.h). */
     union {
@@ -3675,9 +3681,7 @@ do_grant_table_op(
 #include "compat/grant_table.c"
 #endif
 
-void
-gnttab_release_mappings(
-    struct domain *d)
+int gnttab_release_mappings(struct domain *d)
 {
     struct grant_table   *gt = d->grant_table, *rgt;
     struct grant_mapping *map;
@@ -3691,10 +3695,34 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    if ( !gt || !gt->maptrack )
+        return 0;
+
+    for ( handle = gt->maptrack_limit; handle; )
     {
         unsigned int clear_flags = 0;
 
+        /*
+         * Deal with full pages such that their freeing (in the body of the
+         * if()) remains simple.
+         */
+        if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) )
+        {
+            /*
+             * Changing maptrack_limit alters nr_maptrack_frames()'es return
+             * value. Free the then excess trailing page right here, rather
+             * than leaving it to grant_table_destroy() (and in turn requiring
+             * to leave gt->maptrack_limit unaltered).
+             */
+            gt->maptrack_limit = handle;
+            FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]);
+
+            if ( hypercall_preempt_check() )
+                return -ERESTART;
+        }
+
+        --handle;
+
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
             continue;
@@ -3785,6 +3813,11 @@ gnttab_release_mappings(
 
         map->flags = 0;
     }
+
+    gt->maptrack_limit = 0;
+    FREE_XENHEAP_PAGE(gt->maptrack[0]);
+
+    return 0;
 }
 
 void grant_table_warn_active_grants(struct domain *d)
@@ -3848,8 +3881,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
-        free_xenheap_page(t->maptrack[i]);
+    ASSERT(!t->maptrack_limit);
     vfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -47,9 +47,7 @@ void grant_table_init_vcpu(struct vcpu *
 void grant_table_warn_active_grants(struct domain *d);
 
 /* Domain death release of granted mappings of other domains' memory. */
-void
-gnttab_release_mappings(
-    struct domain *d);
+int gnttab_release_mappings(struct domain *d);
 
 int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
                             gfn_t *gfn, uint16_t *status);
@@ -78,7 +76,7 @@ static inline void grant_table_init_vcpu
 
 static inline void grant_table_warn_active_grants(struct domain *d) {}
 
-static inline void gnttab_release_mappings(struct domain *d) {}
+static inline int gnttab_release_mappings(struct domain *d) { return 0; }
 
 static inline int mem_sharing_gref_to_gfn(struct grant_table *gt,
                                           grant_ref_t ref,

[-- Attachment #11: xsa380/xsa380-4.13-2.patch --]
[-- Type: application/octet-stream, Size: 12814 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -36,6 +36,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
+#include <xen/radix-tree.h>
 #include <xen/vmap.h>
 #include <xen/nospec.h>
 #include <xsm/xsm.h>
@@ -81,8 +82,13 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table per vcpu. */
+    /* Handle-indexed tracking table of mappings. */
     struct grant_mapping **maptrack;
+    /*
+     * MFN-indexed tracking tree of mappings, if needed.  Note that this is
+     * protected by @lock, not @maptrack_lock.
+     */
+    struct radix_tree_root maptrack_tree;
 
     /* Domain to which this struct grant_table belongs. */
     const struct domain *domain;
@@ -460,34 +466,6 @@ static int get_paged_frame(unsigned long
     return GNTST_okay;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    /*
-     * See mapkind() for why the write lock is also required for the
-     * remote domain.
-     */
-    if ( lgt < rgt )
-    {
-        grant_write_lock(lgt);
-        grant_write_lock(rgt);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            grant_write_lock(rgt);
-        grant_write_lock(lgt);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    grant_write_unlock(lgt);
-    if ( lgt != rgt )
-        grant_write_unlock(rgt);
-}
-
 #define INVALID_MAPTRACK_HANDLE UINT_MAX
 
 static inline grant_handle_t
@@ -907,41 +885,17 @@ static struct active_grant_entry *grant_
     return ERR_PTR(-EINVAL);
 }
 
-#define MAPKIND_READ 1
-#define MAPKIND_WRITE 2
-static unsigned int mapkind(
-    struct grant_table *lgt, const struct domain *rd, mfn_t mfn)
-{
-    struct grant_mapping *map;
-    grant_handle_t handle, limit = lgt->maptrack_limit;
-    unsigned int kind = 0;
-
-    /*
-     * Must have the local domain's grant table write lock when
-     * iterating over its maptrack entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
-    /*
-     * Must have the remote domain's grant table write lock while
-     * counting its active entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
-
-    smp_rmb();
-
-    for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ )
-    {
-        map = &maptrack_entry(lgt, handle);
-        if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
-             map->domid != rd->domain_id )
-            continue;
-        if ( mfn_eq(_active_entry(rd->grant_table, map->ref).mfn, mfn) )
-            kind |= map->flags & GNTMAP_readonly ?
-                    MAPKIND_READ : MAPKIND_WRITE;
-    }
-
-    return kind;
-}
+union maptrack_node {
+    struct {
+        /* Radix tree slot pointers use two of the bits. */
+#ifdef __BIG_ENDIAN_BITFIELD
+        unsigned long    : 2;
+#endif
+        unsigned long rd : BITS_PER_LONG / 2 - 1;
+        unsigned long wr : BITS_PER_LONG / 2 - 1;
+    } cnt;
+    unsigned long raw;
+};
 
 static void
 map_grant_ref(
@@ -961,7 +915,6 @@ map_grant_ref(
     struct grant_mapping *mt;
     grant_entry_header_t *shah;
     uint16_t *status;
-    bool_t need_iommu;
 
     led = current;
     ld = led->domain;
@@ -1181,31 +1134,75 @@ map_grant_ref(
      * as mem-sharing and IOMMU use are incompatible). The dom_io case would
      * need checking separately if we compared against owner here.
      */
-    need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
-    if ( need_iommu )
+    if ( ld != rd && gnttab_need_iommu_mapping(ld) )
     {
+        union maptrack_node node = {
+            .cnt.rd = !!(op->flags & GNTMAP_readonly),
+            .cnt.wr = !(op->flags & GNTMAP_readonly),
+        };
+        int err;
+        void **slot = NULL;
         unsigned int kind;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+
+        err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(mfn),
+                                radix_tree_ulong_to_ptr(node.raw));
+        if ( err == -EEXIST )
+        {
+            slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(mfn));
+            if ( likely(slot) )
+            {
+                node.raw = radix_tree_ptr_to_ulong(*slot);
+                err = -EBUSY;
+
+                /* Update node only when refcount doesn't overflow. */
+                if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd
+                                                 : ++node.cnt.wr )
+                {
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                    err = 0;
+                }
+            }
+            else
+                ASSERT_UNREACHABLE();
+        }
 
         /*
          * We're not translated, so we know that dfns and mfns are
          * the same things, so the IOMMU entry is always 1-to-1.
          */
-        kind = mapkind(lgt, rd, mfn);
-        if ( !(op->flags & GNTMAP_readonly) &&
-             !(kind & MAPKIND_WRITE) )
+        if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 )
             kind = IOMMUF_readable | IOMMUF_writable;
-        else if ( !kind )
+        else if ( (op->flags & GNTMAP_readonly) &&
+                  node.cnt.rd == 1 && !node.cnt.wr )
             kind = IOMMUF_readable;
         else
             kind = 0;
-        if ( kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind) )
+        if ( err ||
+             (kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind)) )
         {
-            double_gt_unlock(lgt, rgt);
+            if ( !err )
+            {
+                if ( slot )
+                {
+                    op->flags & GNTMAP_readonly ? node.cnt.rd--
+                                                : node.cnt.wr--;
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                }
+                else
+                    radix_tree_delete(&lgt->maptrack_tree, mfn_x(mfn));
+            }
+
             rc = GNTST_general_error;
-            goto undo_out;
         }
+
+        grant_write_unlock(lgt);
+
+        if ( rc != GNTST_okay )
+            goto undo_out;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom);
@@ -1213,10 +1210,6 @@ map_grant_ref(
     /*
      * All maptrack entry users check mt->flags first before using the
      * other fields so just ensure the flags field is stored last.
-     *
-     * However, if gnttab_need_iommu_mapping() then this would race
-     * with a concurrent mapkind() call (on an unmap, for example)
-     * and a lock is required.
      */
     mt = &maptrack_entry(lgt, handle);
     mt->domid = op->dom;
@@ -1224,9 +1217,6 @@ map_grant_ref(
     smp_wmb();
     write_atomic(&mt->flags, op->flags);
 
-    if ( need_iommu )
-        double_gt_unlock(lgt, rgt);
-
     op->dev_bus_addr = mfn_to_maddr(mfn);
     op->handle       = handle;
     op->status       = GNTST_okay;
@@ -1448,19 +1438,34 @@ unmap_common(
     /* See the respective comment in map_grant_ref(). */
     if ( rc == GNTST_okay && ld != rd && gnttab_need_iommu_mapping(ld) )
     {
-        unsigned int kind;
+        void **slot;
+        union maptrack_node node;
         int err = 0;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+        slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->mfn));
+        node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0;
+
+        /* Refcount must not underflow. */
+        if ( !(flags & GNTMAP_readonly ? node.cnt.rd--
+                                       : node.cnt.wr--) )
+            BUG();
 
-        kind = mapkind(lgt, rd, op->mfn);
-        if ( !kind )
+        if ( !node.raw )
             err = iommu_legacy_unmap(ld, _dfn(mfn_x(op->mfn)), 0);
-        else if ( !(kind & MAPKIND_WRITE) )
+        else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr )
             err = iommu_legacy_map(ld, _dfn(mfn_x(op->mfn)), op->mfn, 0,
                                    IOMMUF_readable);
 
-        double_gt_unlock(lgt, rgt);
+        if ( err )
+            ;
+        else if ( !node.raw )
+            radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->mfn));
+        else
+            radix_tree_replace_slot(slot,
+                                    radix_tree_ulong_to_ptr(node.raw));
+
+        grant_write_unlock(lgt);
 
         if ( err )
             rc = GNTST_general_error;
@@ -1918,6 +1923,8 @@ int grant_table_init(struct domain *d, i
         gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
         if ( gt->maptrack == NULL )
             goto out;
+
+        radix_tree_init(&gt->maptrack_tree);
     }
 
     /* Shared grant table. */
@@ -3701,6 +3708,7 @@ int gnttab_release_mappings(struct domai
     for ( handle = gt->maptrack_limit; handle; )
     {
         unsigned int clear_flags = 0;
+        mfn_t mfn;
 
         /*
          * Deal with full pages such that their freeing (in the body of the
@@ -3806,17 +3814,31 @@ int gnttab_release_mappings(struct domai
         if ( clear_flags )
             gnttab_clear_flags(rd, clear_flags, status);
 
+        mfn = act->mfn;
+
         active_entry_release(act);
         grant_read_unlock(rgt);
 
         rcu_unlock_domain(rd);
 
         map->flags = 0;
+
+        /*
+         * This is excessive in that a single such call would suffice per
+         * mapped MFN (or none at all, if no entry was ever inserted). But it
+         * should be the common case for an MFN to be mapped just once, and
+         * this way we don't need to further maintain the counters. We also
+         * don't want to leave cleaning up of the tree as a whole to the end
+         * of the function, as this could take quite some time.
+         */
+        radix_tree_delete(&gt->maptrack_tree, mfn_x(mfn));
     }
 
     gt->maptrack_limit = 0;
     FREE_XENHEAP_PAGE(gt->maptrack[0]);
 
+    radix_tree_destroy(&gt->maptrack_tree, NULL);
+
     return 0;
 }
 
--- a/xen/include/xen/radix-tree.h
+++ b/xen/include/xen/radix-tree.h
@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int(
     return (int)((long)ptr >> 2);
 }
 
+/**
+ * radix_tree_{ulong_to_ptr,ptr_to_ulong}:
+ *
+ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2
+ * bits are actually usable for the value.
+ */
+static inline void *radix_tree_ulong_to_ptr(unsigned long val)
+{
+    unsigned long ptr = (val << 2) | 0x2;
+    ASSERT((ptr >> 2) == val);
+    return (void *)ptr;
+}
+
+static inline unsigned long radix_tree_ptr_to_ulong(void *ptr)
+{
+    ASSERT(((unsigned long)ptr & 0x3) == 0x2);
+    return (unsigned long)ptr >> 2;
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);

[-- Attachment #12: xsa380/xsa380-4.14-1.patch --]
[-- Type: application/octet-stream, Size: 5187 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -721,11 +721,13 @@ int domain_kill(struct domain *d)
             return domain_kill(d);
         d->is_dying = DOMDYING_dying;
         argo_destroy(d);
-        gnttab_release_mappings(d);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
     case DOMDYING_dying:
+        rc = gnttab_release_mappings(d);
+        if ( rc )
+            break;
         rc = evtchn_destroy(d);
         if ( rc )
             break;
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -64,7 +64,13 @@ struct grant_table {
     unsigned int          nr_grant_frames;
     /* Number of grant status frames shared with guest (for version 2) */
     unsigned int          nr_status_frames;
-    /* Number of available maptrack entries. */
+    /*
+     * Number of available maptrack entries.  For cleanup purposes it is
+     * important to realize that this field and @maptrack further down will
+     * only ever be accessed by the local domain.  Thus it is okay to clean
+     * up early, and to shrink the limit for the purpose of tracking cleanup
+     * progress.
+     */
     unsigned int          maptrack_limit;
     /* Shared grant table (see include/public/grant_table.h). */
     union {
@@ -3708,9 +3714,7 @@ do_grant_table_op(
 #include "compat/grant_table.c"
 #endif
 
-void
-gnttab_release_mappings(
-    struct domain *d)
+int gnttab_release_mappings(struct domain *d)
 {
     struct grant_table   *gt = d->grant_table, *rgt;
     struct grant_mapping *map;
@@ -3724,10 +3728,34 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    if ( !gt || !gt->maptrack )
+        return 0;
+
+    for ( handle = gt->maptrack_limit; handle; )
     {
         unsigned int clear_flags = 0;
 
+        /*
+         * Deal with full pages such that their freeing (in the body of the
+         * if()) remains simple.
+         */
+        if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) )
+        {
+            /*
+             * Changing maptrack_limit alters nr_maptrack_frames()'es return
+             * value. Free the then excess trailing page right here, rather
+             * than leaving it to grant_table_destroy() (and in turn requiring
+             * to leave gt->maptrack_limit unaltered).
+             */
+            gt->maptrack_limit = handle;
+            FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]);
+
+            if ( hypercall_preempt_check() )
+                return -ERESTART;
+        }
+
+        --handle;
+
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
             continue;
@@ -3818,6 +3846,11 @@ gnttab_release_mappings(
 
         map->flags = 0;
     }
+
+    gt->maptrack_limit = 0;
+    FREE_XENHEAP_PAGE(gt->maptrack[0]);
+
+    return 0;
 }
 
 void grant_table_warn_active_grants(struct domain *d)
@@ -3881,8 +3914,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
-        free_xenheap_page(t->maptrack[i]);
+    ASSERT(!t->maptrack_limit);
     vfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -47,9 +47,7 @@ void grant_table_init_vcpu(struct vcpu *
 void grant_table_warn_active_grants(struct domain *d);
 
 /* Domain death release of granted mappings of other domains' memory. */
-void
-gnttab_release_mappings(
-    struct domain *d);
+int gnttab_release_mappings(struct domain *d);
 
 int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
                             gfn_t *gfn, uint16_t *status);
@@ -78,7 +76,7 @@ static inline void grant_table_init_vcpu
 
 static inline void grant_table_warn_active_grants(struct domain *d) {}
 
-static inline void gnttab_release_mappings(struct domain *d) {}
+static inline int gnttab_release_mappings(struct domain *d) { return 0; }
 
 static inline int mem_sharing_gref_to_gfn(struct grant_table *gt,
                                           grant_ref_t ref,

[-- Attachment #13: xsa380/xsa380-4.14-2.patch --]
[-- Type: application/octet-stream, Size: 12801 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -37,6 +37,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
+#include <xen/radix-tree.h>
 #include <xen/vmap.h>
 #include <xen/nospec.h>
 #include <xsm/xsm.h>
@@ -82,8 +83,13 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table per vcpu. */
+    /* Handle-indexed tracking table of mappings. */
     struct grant_mapping **maptrack;
+    /*
+     * MFN-indexed tracking tree of mappings, if needed.  Note that this is
+     * protected by @lock, not @maptrack_lock.
+     */
+    struct radix_tree_root maptrack_tree;
 
     /* Domain to which this struct grant_table belongs. */
     const struct domain *domain;
@@ -501,34 +507,6 @@ static int get_paged_frame(unsigned long
     return GNTST_okay;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    /*
-     * See mapkind() for why the write lock is also required for the
-     * remote domain.
-     */
-    if ( lgt < rgt )
-    {
-        grant_write_lock(lgt);
-        grant_write_lock(rgt);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            grant_write_lock(rgt);
-        grant_write_lock(lgt);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    grant_write_unlock(lgt);
-    if ( lgt != rgt )
-        grant_write_unlock(rgt);
-}
-
 #define INVALID_MAPTRACK_HANDLE UINT_MAX
 
 static inline grant_handle_t
@@ -948,41 +926,17 @@ static struct active_grant_entry *grant_
     return ERR_PTR(-EINVAL);
 }
 
-#define MAPKIND_READ 1
-#define MAPKIND_WRITE 2
-static unsigned int mapkind(
-    struct grant_table *lgt, const struct domain *rd, mfn_t mfn)
-{
-    struct grant_mapping *map;
-    grant_handle_t handle, limit = lgt->maptrack_limit;
-    unsigned int kind = 0;
-
-    /*
-     * Must have the local domain's grant table write lock when
-     * iterating over its maptrack entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
-    /*
-     * Must have the remote domain's grant table write lock while
-     * counting its active entries.
-     */
-    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
-
-    smp_rmb();
-
-    for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ )
-    {
-        map = &maptrack_entry(lgt, handle);
-        if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
-             map->domid != rd->domain_id )
-            continue;
-        if ( mfn_eq(_active_entry(rd->grant_table, map->ref).mfn, mfn) )
-            kind |= map->flags & GNTMAP_readonly ?
-                    MAPKIND_READ : MAPKIND_WRITE;
-    }
-
-    return kind;
-}
+union maptrack_node {
+    struct {
+        /* Radix tree slot pointers use two of the bits. */
+#ifdef __BIG_ENDIAN_BITFIELD
+        unsigned long    : 2;
+#endif
+        unsigned long rd : BITS_PER_LONG / 2 - 1;
+        unsigned long wr : BITS_PER_LONG / 2 - 1;
+    } cnt;
+    unsigned long raw;
+};
 
 static void
 map_grant_ref(
@@ -1001,7 +955,6 @@ map_grant_ref(
     struct grant_mapping *mt;
     grant_entry_header_t *shah;
     uint16_t *status;
-    bool_t need_iommu;
 
     ld = current->domain;
 
@@ -1220,31 +1173,75 @@ map_grant_ref(
      * as mem-sharing and IOMMU use are incompatible). The dom_io case would
      * need checking separately if we compared against owner here.
      */
-    need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
-    if ( need_iommu )
+    if ( ld != rd && gnttab_need_iommu_mapping(ld) )
     {
+        union maptrack_node node = {
+            .cnt.rd = !!(op->flags & GNTMAP_readonly),
+            .cnt.wr = !(op->flags & GNTMAP_readonly),
+        };
+        int err;
+        void **slot = NULL;
         unsigned int kind;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+
+        err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(mfn),
+                                radix_tree_ulong_to_ptr(node.raw));
+        if ( err == -EEXIST )
+        {
+            slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(mfn));
+            if ( likely(slot) )
+            {
+                node.raw = radix_tree_ptr_to_ulong(*slot);
+                err = -EBUSY;
+
+                /* Update node only when refcount doesn't overflow. */
+                if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd
+                                                 : ++node.cnt.wr )
+                {
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                    err = 0;
+                }
+            }
+            else
+                ASSERT_UNREACHABLE();
+        }
 
         /*
          * We're not translated, so we know that dfns and mfns are
          * the same things, so the IOMMU entry is always 1-to-1.
          */
-        kind = mapkind(lgt, rd, mfn);
-        if ( !(op->flags & GNTMAP_readonly) &&
-             !(kind & MAPKIND_WRITE) )
+        if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 )
             kind = IOMMUF_readable | IOMMUF_writable;
-        else if ( !kind )
+        else if ( (op->flags & GNTMAP_readonly) &&
+                  node.cnt.rd == 1 && !node.cnt.wr )
             kind = IOMMUF_readable;
         else
             kind = 0;
-        if ( kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind) )
+        if ( err ||
+             (kind && iommu_legacy_map(ld, _dfn(mfn_x(mfn)), mfn, 0, kind)) )
         {
-            double_gt_unlock(lgt, rgt);
+            if ( !err )
+            {
+                if ( slot )
+                {
+                    op->flags & GNTMAP_readonly ? node.cnt.rd--
+                                                : node.cnt.wr--;
+                    radix_tree_replace_slot(slot,
+                                            radix_tree_ulong_to_ptr(node.raw));
+                }
+                else
+                    radix_tree_delete(&lgt->maptrack_tree, mfn_x(mfn));
+            }
+
             rc = GNTST_general_error;
-            goto undo_out;
         }
+
+        grant_write_unlock(lgt);
+
+        if ( rc != GNTST_okay )
+            goto undo_out;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom);
@@ -1252,10 +1249,6 @@ map_grant_ref(
     /*
      * All maptrack entry users check mt->flags first before using the
      * other fields so just ensure the flags field is stored last.
-     *
-     * However, if gnttab_need_iommu_mapping() then this would race
-     * with a concurrent mapkind() call (on an unmap, for example)
-     * and a lock is required.
      */
     mt = &maptrack_entry(lgt, handle);
     mt->domid = op->dom;
@@ -1263,9 +1256,6 @@ map_grant_ref(
     smp_wmb();
     write_atomic(&mt->flags, op->flags);
 
-    if ( need_iommu )
-        double_gt_unlock(lgt, rgt);
-
     op->dev_bus_addr = mfn_to_maddr(mfn);
     op->handle       = handle;
     op->status       = GNTST_okay;
@@ -1487,19 +1477,34 @@ unmap_common(
     /* See the respective comment in map_grant_ref(). */
     if ( rc == GNTST_okay && ld != rd && gnttab_need_iommu_mapping(ld) )
     {
-        unsigned int kind;
+        void **slot;
+        union maptrack_node node;
         int err = 0;
 
-        double_gt_lock(lgt, rgt);
+        grant_write_lock(lgt);
+        slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->mfn));
+        node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0;
+
+        /* Refcount must not underflow. */
+        if ( !(flags & GNTMAP_readonly ? node.cnt.rd--
+                                       : node.cnt.wr--) )
+            BUG();
 
-        kind = mapkind(lgt, rd, op->mfn);
-        if ( !kind )
+        if ( !node.raw )
             err = iommu_legacy_unmap(ld, _dfn(mfn_x(op->mfn)), 0);
-        else if ( !(kind & MAPKIND_WRITE) )
+        else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr )
             err = iommu_legacy_map(ld, _dfn(mfn_x(op->mfn)), op->mfn, 0,
                                    IOMMUF_readable);
 
-        double_gt_unlock(lgt, rgt);
+        if ( err )
+            ;
+        else if ( !node.raw )
+            radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->mfn));
+        else
+            radix_tree_replace_slot(slot,
+                                    radix_tree_ulong_to_ptr(node.raw));
+
+        grant_write_unlock(lgt);
 
         if ( err )
             rc = GNTST_general_error;
@@ -1951,6 +1956,8 @@ int grant_table_init(struct domain *d, i
         gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
         if ( gt->maptrack == NULL )
             goto out;
+
+        radix_tree_init(&gt->maptrack_tree);
     }
 
     /* Shared grant table. */
@@ -3734,6 +3741,7 @@ int gnttab_release_mappings(struct domai
     for ( handle = gt->maptrack_limit; handle; )
     {
         unsigned int clear_flags = 0;
+        mfn_t mfn;
 
         /*
          * Deal with full pages such that their freeing (in the body of the
@@ -3839,17 +3847,31 @@ int gnttab_release_mappings(struct domai
         if ( clear_flags )
             gnttab_clear_flags(rd, clear_flags, status);
 
+        mfn = act->mfn;
+
         active_entry_release(act);
         grant_read_unlock(rgt);
 
         rcu_unlock_domain(rd);
 
         map->flags = 0;
+
+        /*
+         * This is excessive in that a single such call would suffice per
+         * mapped MFN (or none at all, if no entry was ever inserted). But it
+         * should be the common case for an MFN to be mapped just once, and
+         * this way we don't need to further maintain the counters. We also
+         * don't want to leave cleaning up of the tree as a whole to the end
+         * of the function, as this could take quite some time.
+         */
+        radix_tree_delete(&gt->maptrack_tree, mfn_x(mfn));
     }
 
     gt->maptrack_limit = 0;
     FREE_XENHEAP_PAGE(gt->maptrack[0]);
 
+    radix_tree_destroy(&gt->maptrack_tree, NULL);
+
     return 0;
 }
 
--- a/xen/include/xen/radix-tree.h
+++ b/xen/include/xen/radix-tree.h
@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int(
     return (int)((long)ptr >> 2);
 }
 
+/**
+ * radix_tree_{ulong_to_ptr,ptr_to_ulong}:
+ *
+ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2
+ * bits are actually usable for the value.
+ */
+static inline void *radix_tree_ulong_to_ptr(unsigned long val)
+{
+    unsigned long ptr = (val << 2) | 0x2;
+    ASSERT((ptr >> 2) == val);
+    return (void *)ptr;
+}
+
+static inline unsigned long radix_tree_ptr_to_ulong(void *ptr)
+{
+    ASSERT(((unsigned long)ptr & 0x3) == 0x2);
+    return (unsigned long)ptr >> 2;
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);

                 reply	other threads:[~2021-09-01  9:31 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E1mLMa2-0006VB-JU@xenbits.xenproject.org \
    --to=security@xen.org \
    --cc=oss-security@lists.openwall.com \
    --cc=security-team-members@xen.org \
    --cc=xen-announce@lists.xen.org \
    --cc=xen-devel@lists.xen.org \
    --cc=xen-users@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.