Xen Security Advisory 378 v3 (CVE-2021-28694,CVE-2021-28695,CVE-2021-28696)

* Xen Security Advisory 378 v3 (CVE-2021-28694,CVE-2021-28695,CVE-2021-28696) - IOMMU page mapping issues on x86
@ 2021-09-01  9:30 Xen.org security team
  2021-09-01 13:22 ` Jason Andryuk
  0 siblings, 1 reply; 4+ messages in thread
From: Xen.org security team @ 2021-09-01  9:30 UTC (permalink / raw)
  To: xen-announce, xen-devel, xen-users, oss-security; +Cc: Xen.org security team

[-- Attachment #1: Type: text/plain, Size: 10685 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

 Xen Security Advisory CVE-2021-28694,CVE-2021-28695,CVE-2021-28696 / XSA-378
                                   version 3

                   IOMMU page mapping issues on x86

UPDATES IN VERSION 3
====================

Warn about dom0=pvh breakage in Resolution section.

ISSUE DESCRIPTION
=================

Both AMD and Intel allow ACPI tables to specify regions of memory
which should be left untranslated, which typically means these
addresses should pass the translation phase unaltered.  While these
are typically device specific ACPI properties, they can also be
specified to apply to a range of devices, or even all devices.

On all systems with such regions Xen failed to prevent guests from
undoing/replacing such mappings (CVE-2021-28694).

On AMD systems, where a discontinuous range is specified by firmware,
the supposedly-excluded middle range will also be identity-mapped
(CVE-2021-28695).

Further, on AMD systems, upon de-assigment of a physical device from a
guest, the identity mappings would be left in place, allowing a guest
continued access to ranges of memory which it shouldn't have access to
anymore (CVE-2021-28696).

IMPACT
======

The precise impact is system specific, but can - on affected systems -
be any or all of privilege escalation, denial of service, or information
leaks.

VULNERABLE SYSTEMS
==================

The vulnerability is only exploitable by guests granted access to
physical devices (ie, via PCI passthrough).

All versions of Xen are affected.

Only x86 systems with IOMMUs and with firmware specifying memory regions
to be identity mapped are affected.  Other x86 systems are not affected.

Whether a particular system whose ACPI tables declare such memory
region(s) is actually affected cannot be known without knowing when
and/or how these regions are used.  For example, if these regions were
used only during system boot, there would not be any vulnerability.
The necessary knowledge can only be obtained from, collectively, the
hardware and firmware manufacturers.

On Arm hardware IOMMU use is not security supported.  Accordingly, we
have not undertaken an analysis of these issues for Arm systems.

MITIGATION
==========

Not permitting untrusted guests access to phsyical devices will avoid
the vulnerability.

Likewise, limiting untrusted guest access to physical devices whose
firmware-provided ACPI tables declare identity mappings, will avoid
the vulnerability.  (Provided that there are no identity mapped
regions which are specified by the ACPI tables to apply globally.)

Note that a system is still vulnerable if a guest was trusted, while
it had such a device assigned, and then has the device removed in
anticipation of the guest becoming untrusted (because of, for example,
the insertion of an untrusted kernel module),

CREDITS
=======

This issue was discovered by Jan Beulich of SUSE.

RESOLUTION
==========

Applying the appropriate set of attached patches resolves this issue.

However, these patches are known to badly break PVH dom0 support.
Work is ongoing to resolve the problem, and will be committed to the
appropriate Xen trees (including backports) when ready.  PVH dom0 has
"Technical Preview" status so we do not intend to distribute the fixes
via an update to this advisory

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa378/xsa378-?.patch           xen-unstable
xsa378/xsa378-4.15-?.patch      Xen 4.15.x
xsa378/xsa378-4.14-?.patch      Xen 4.14.x
xsa378/xsa378-4.13-*.patch      Xen 4.13.x
xsa378/xsa378-4.12-*.patch      Xen 4.12.x
xsa378/xsa378-4.11-*.patch      Xen 4.11.x

$ sha256sum xsa378* xsa378*/*
b3b49681468bd2be4b95fc1d3861493bbea63bac1c115a12ece76d0313b6a81c  xsa378.meta
2560d4418e6b6d022b4e7fe1f84906ca13b3180537e746ee81cd3f97f0e86c35  xsa378/xsa378-1.patch
134b207c38f8d76bbc218220f1e61f82780e8166818ef51217716023cc27ce3c  xsa378/xsa378-2.patch
c166ef60e08a80440f79d92abe8488b123c8017e7633dbddacb95b20899f9b2b  xsa378/xsa378-3.patch
a2677e157724f57f65165f361a309c417d123ff009168f16d3a0e342b7a601eb  xsa378/xsa378-4.11-0a.patch
832f52f420bb47784bdefd75edd1ca521658327aafb94b4f3586da522d0d19c1  xsa378/xsa378-4.11-0b.patch
9eb7c53892f01424eec6b788f879b36dd75a6f89b1c8b9c973237603525d6546  xsa378/xsa378-4.11-0c.patch
48021fb9a2c52d7939fd11bbba80c67a5a5b6cae2039b4b2a2c6203eb5841260  xsa378/xsa378-4.11-1.patch
4f3fd39aa328c1e397aa2a10b92e98cc341908a8834302a4fc0f230876dd3570  xsa378/xsa378-4.11-2.patch
6e2eeecd2ccc4029f80888263f44b80acf078d69aeffda4d6accd8b2d0899182  xsa378/xsa378-4.11-3.patch
194b27510c8765f1cf8abee57624a075ab26f4f889aa8c201e126643f570f55a  xsa378/xsa378-4.11-4.patch
e8f4eab0984a7db2ca0c6c9ac0fb9107e45ddd572a36454db4a1bdb8b9a8b0c6  xsa378/xsa378-4.11-5.patch
943c7556be47b1a0f2273cf5691fe70dc65e6cb50ff0477d1d61eb3f0ba87a97  xsa378/xsa378-4.11-6.patch
2eb572e1a55caa4aca31f74a3844135a204c4e023a6776d7adc4f4043663fb99  xsa378/xsa378-4.11-7.patch
7c24ba33461d3adcb195a6611df3e9c0501e9dcacf9b8811456450b291d23edf  xsa378/xsa378-4.11-8.patch
fe14ea6699df673b787f93ff821838f2280407674df0c8c06d40efe8320e8748  xsa378/xsa378-4.12-0a.patch
1ba10aef9ea99c4455a34e61792ca65ea6b2ece56f05cd0b4adc14600a7ae346  xsa378/xsa378-4.12-0b.patch
2c2dbc0c18b695c1d2c93e4137228f82b6d21d895c3165a9e29ddea3db78e36a  xsa378/xsa378-4.12-0c.patch
48021fb9a2c52d7939fd11bbba80c67a5a5b6cae2039b4b2a2c6203eb5841260  xsa378/xsa378-4.12-1.patch
f0d62506ccdf081d0efbb553e2d33f17a084272d21f43bf95d11fda2dcc8a3fa  xsa378/xsa378-4.12-2.patch
e266512b18fc30e5bd4884cd720aa81644d3ca3323b38fe1f77d06fa98dd515d  xsa378/xsa378-4.12-3.patch
076f9955593a8ebdea5f24ed302b8a1004bbf50da4c8becf1f93764066b8981c  xsa378/xsa378-4.12-4.patch
9639bc35636ffae4e2b6dd026387347960c1fb986cb2924a18314a72e6b6ec0a  xsa378/xsa378-4.12-5.patch
32dd659aa365d9f8197d99b9334c22b33e9ce805e30376457df6e507f92282c8  xsa378/xsa378-4.12-6.patch
7eaae2fa968eb11e27277141456cc1bd657025aee738221c368e153535c7c0f4  xsa378/xsa378-4.12-7.patch
05982f43f35b580ff41f74b1280e469e0dd20176f184ec04a4874303c2aa3ad1  xsa378/xsa378-4.12-8.patch
c6f551ef9903a343b47692b34a63c70165a2dbf74878a6be6511cdffd55a7e8c  xsa378/xsa378-4.13-0a.patch
223cd63f7e1c39d862b8654da698455ded65ecb5abae0f57c330921522b7fdb4  xsa378/xsa378-4.13-0b.patch
5c33aa24f14e779dfe914e809cff11260083169a3adfb07a31ce11243d80b3ef  xsa378/xsa378-4.13-0c.patch
1d55426ff6a41f0ef4cbd2c943edafff394157703bb0b6ae751564abf93b5ee7  xsa378/xsa378-4.13-1.patch
4d5e7d5e65cd28d6bc7d1a9f2ab24f09dfaef295c4199d5f2db00915dcaa174f  xsa378/xsa378-4.13-2.patch
78df0bcb347f8bb45827f74b191aad36b6e907eb38c6d535035f2b2739645551  xsa378/xsa378-4.13-3.patch
237c33e0ae01a23db01721afb8e6a39101bfe081f8b75dbcff6b9fa9c9aaceda  xsa378/xsa378-4.13-4.patch
7d2f3ae3881d28073be54a6dfb35f13004e4efee742952788430201d86307ecf  xsa378/xsa378-4.13-5.patch
2ecd7580394667db0c41e4819025393e59ae24d6c97d54451c8e683585057367  xsa378/xsa378-4.13-6.patch
592a03d00e5d22d7a3c001681968dc469c70b3e57998b95877388e9528904ea2  xsa378/xsa378-4.13-7.patch
54e6a095f706c66dbbd74e39aa1d88031c9b537589e73ceb2925e2f0cc1854f0  xsa378/xsa378-4.13-8.patch
1d55426ff6a41f0ef4cbd2c943edafff394157703bb0b6ae751564abf93b5ee7  xsa378/xsa378-4.14-1.patch
86fbd88eb8a358575e42cf335c444b047ddb4d2f1c6a1bc6f9e57e6ac0041074  xsa378/xsa378-4.14-2.patch
0a23e5f93ff1bb55f003a56e0ef8c531384b164f0e840f5794acdb9ae3e91996  xsa378/xsa378-4.14-3.patch
e2226f7ddae1d24dbee8cf19efa8d67ebf312f3d10641cb9aec21d68a3c8f818  xsa378/xsa378-4.14-4.patch
f2cf8f7e4aa0460e606b5564adde366332ed323c4d5e3f957e64299ef1bc9baf  xsa378/xsa378-4.14-5.patch
7811426975757e3bb8c6ab3161ba2354e1780b55b9e6be7928229d5f23bf79b6  xsa378/xsa378-4.14-6.patch
58147bd6c0ea4e08e84a17afc796be4bbe53e6fbc1d393f9fe3c6191fd33eba5  xsa378/xsa378-4.14-7.patch
682a011d807a7c284faf0ec9d2cf0aabaddbc658979dea2b9ccbc007b660f9c5  xsa378/xsa378-4.14-8.patch
a00ada0cd673f0909cc7b462cb532dbd6fe17601e06bd84272f5ff1857ea4c73  xsa378/xsa378-4.15-1.patch
d74cc325be1e47d61ba3b1400837af35d044bf1d25806aa98926ec262f80bddd  xsa378/xsa378-4.15-2.patch
2ad685a04dbbdd2c81761b58146e70059b8a8a92b0c1176f36933510293ece5b  xsa378/xsa378-4.15-3.patch
18de9facccd70ce49dd839e219fe71667c43110e474e5d7b56a503a5786fc7e0  xsa378/xsa378-4.15-4.patch
296558b27ba82176f6d06b721102c8ba7c7e6e99d29b29392ea82244e88df0b9  xsa378/xsa378-4.15-5.patch
5d7cc84c66daf0aceab9407fa72f2827024c847f4ca10f8d123eee87b7451aba  xsa378/xsa378-4.15-6.patch
fd4aa4447562230a6684d17e6e4c55f1e48df4773247f66ab9d01181003bc9aa  xsa378/xsa378-4.15-7.patch
5305e3bc513bcd3e016ca3bbecc6cae38c8ed2b2eacb13e82f7b1f4401d3b67d  xsa378/xsa378-4.15-8.patch
78390bf59344ea5dcbfa5831db3634b3f1b3aabf029160c72cdbeeec2b46b2a9  xsa378/xsa378-4.patch
76fb4a1f2604b98d3a803744ad212b3984117ec8c8011ad6db3759f9337a9b5a  xsa378/xsa378-5.patch
a627f8c6b7d2ffac0b6de945189f96b718aeab8e0c8bb11476a40585d6411bd7  xsa378/xsa378-6.patch
28814b51fabd3c5cb3ca249ad291781f589cddd55fc8152fdd5668c5fcdc727c  xsa378/xsa378-7.patch
30e31749ade75fd5ab4c41fa27fe2124bdcc602ccc803a85e5eeec1b9c48a9c1  xsa378/xsa378-8.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.


(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmEvSCUMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZrUYH/iYMkjMzUpv5ik/quZ+z34uXwq/mD8x8ROZKXDev
tgojXcFo0vNlcr71R0J5KrE0YeX+3MjbjqbF7A30cnw4/bgmvhzcdP/fwOfIKFYE
JqaGoDaxFRpsIMvHFXV6OxtjgBgleUukaikUrcwWUG+L90KMmjqZU4IAlNcBAALC
J6rDB8nHqD1s+ODIPdPE149jIE7LJfVqyuu9h4r/jmLEDRJBS2YEs6zgClqnIApl
3WBpZNhp8Tk3BFtbQcu4Uh9c0itymtp+8VsY0xzF2bWsE4aoe5DKHv0DyTSLe9Md
pMG0egH7ZrI3MV4nzHdbhkA2Hn3X0rsJiI7TXpmmQPmgi4g=
=A/oT
-----END PGP SIGNATURE-----

[-- Attachment #2: xsa378.meta --]
[-- Type: application/octet-stream, Size: 1756 bytes --]

{
  "XSA": 378,
  "SupportedVersions": [
    "master",
    "4.15",
    "4.14",
    "4.13",
    "4.12",
    "4.11"
  ],
  "Trees": [
    "xen"
  ],
  "Recipes": {
    "4.11": {
      "Recipes": {
        "xen": {
          "StableRef": "ef32c7afa2731b758226d6e10a1e489b1a15fc41",
          "Prereqs": [],
          "Patches": [
            "xsa378/xsa378-4.11-*.patch"
          ]
        }
      }
    },
    "4.12": {
      "Recipes": {
        "xen": {
          "StableRef": "ea20eee97e9e0861127a8070cc7b9ae3557b09fb",
          "Prereqs": [
            383
          ],
          "Patches": [
            "xsa378/xsa378-4.12-*.patch"
          ]
        }
      }
    },
    "4.13": {
      "Recipes": {
        "xen": {
          "StableRef": "32d580902b959000d79d51dff03a3560653c4fcb",
          "Prereqs": [
            383
          ],
          "Patches": [
            "xsa378/xsa378-4.13-*.patch"
          ]
        }
      }
    },
    "4.14": {
      "Recipes": {
        "xen": {
          "StableRef": "49299c4813b7847d29df07bf790f5489060f2a9c",
          "Prereqs": [
            383
          ],
          "Patches": [
            "xsa378/xsa378-4.14-*.patch"
          ]
        }
      }
    },
    "4.15": {
      "Recipes": {
        "xen": {
          "StableRef": "dba774896f7dd74773c14d537643b7d7477fefcd",
          "Prereqs": [
            383
          ],
          "Patches": [
            "xsa378/xsa378-4.15-*.patch"
          ]
        }
      }
    },
    "master": {
      "Recipes": {
        "xen": {
          "StableRef": "25da9455f1bb8a6d33039575a7b28bdfc4e3fcfe",
          "Prereqs": [
            383
          ],
          "Patches": [
            "xsa378/xsa378-?.patch"
          ]
        }
      }
    }
  }
}

[-- Attachment #3: xsa378/xsa378-1.patch --]
[-- Type: application/octet-stream, Size: 5637 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v8: Add comments to plain true/false function arguments.
v4: Don't accept requests not allowing for both reading and writing.
    Don't accept mismatches in the exclusion_allow_all setting. Cover
    for the dropping of "correct limit calculation in IVMD parsing".
v2: Accept overlapping regions as well.
---
TBD: Should we assume random order of entries right away?
TBD: Andrew suggests there are indications of only one global exclusion
     range being allowed, which would then (presumably) be tied to
     ACPI_IVMD_EXCLUSION_RANGE (as per above).

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -116,12 +116,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -129,16 +138,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -176,6 +180,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -197,10 +202,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -211,6 +221,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -240,12 +251,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -255,6 +267,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -285,8 +298,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #4: xsa378/xsa378-2.patch --]
[-- Type: application/octet-stream, Size: 9300 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
TBD: Overlapping ranges could be dealt with (when they specify the same
     permissions) by splitting the ranges to non-overlapping pieces. See
     "IOMMU: generalize VT-d's tracking of mapped RMRR regions" and its
     user "AMD/IOMMU: re-arrange/complete re-assignment handling" for
     why overlaps can't be accepted with the chosen tracking model. But
     the base assumption is anyway for firmware to not surface
     overlapping ranges in the first place.
---
v8: Fix build (failure resulted from a piece of code living here which
    only a later patch should add). Re-base over adjustments to earlier
    patch.
v5: Don't allow overlaps (except exact matches) and don't merge.
    Re-base.
v4: Adjust title. Re-base over changes earlier in the series.
v2: Re-base over XSA-302.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -107,20 +107,24 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     uint16_t dte_requestor_id;
     bool valid:1;
     bool dte_allow_exclusion:1;
-    bool unity_map_enable:1;
-    bool write_permission:1;
-    bool read_permission:1;
 
     /* ivhd device data settings */
     uint8_t device_flags;
 
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -145,32 +145,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = true;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -193,13 +209,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -241,15 +257,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -280,15 +296,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -297,7 +313,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -384,15 +384,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);

[-- Attachment #5: xsa378/xsa378-3.patch --]
[-- Type: application/octet-stream, Size: 4201 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v5: New.

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -808,7 +808,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     bool_t entry_written = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -938,8 +938,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -545,6 +545,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Checks only applicable to entries with order > PAGE_ORDER_4K */
 static void check_entry(mfn_t mfn, p2m_type_t new, p2m_type_t old,
                         unsigned int order)
@@ -579,7 +589,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -642,6 +652,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -684,9 +695,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -714,6 +726,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -891,7 +891,8 @@ static inline void p2m_altp2m_check(stru
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #6: xsa378/xsa378-4.11-0a.patch --]
[-- Type: application/octet-stream, Size: 2478 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: fix PoD accounting in guest_physmap_add_entry()

The initial observation was that the mfn_valid() check comes too late:
Neither mfn_add() nor mfn_to_page() (let alone de-referencing the
result of the latter) are valid for MFNs failing this check. Move it up
and - noticing that there's no caller doing so - also add an assertion
that this should never produce "false" here.

In turn this would have meant that the "else" to that if() could now go
away, which didn't seem right at all. And indeed, considering callers
like memory_exchange() or various grant table functions, the PoD
accounting should have been outside of that if() from the very
beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -794,6 +794,12 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
+    if ( !mfn_valid(mfn) )
+    {
+        ASSERT_UNREACHABLE();
+        return -EINVAL;
+    }
+
     p2m_lock(p2m);
 
     P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn_x(gfn), mfn_x(mfn));
@@ -894,12 +900,13 @@ guest_physmap_add_entry(struct domain *d
     }
 
     /* Now, actually do the two-way mapping */
-    if ( mfn_valid(mfn) )
+    rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, p2m->default_access);
+    if ( rc == 0 )
     {
-        rc = p2m_set_entry(p2m, gfn, mfn, page_order, t,
-                           p2m->default_access);
-        if ( rc )
-            goto out; /* Failed to update p2m, bail without updating m2p. */
+        pod_lock(p2m);
+        p2m->pod.entry_count -= pod_count;
+        BUG_ON(p2m->pod.entry_count < 0);
+        pod_unlock(p2m);
 
         if ( !p2m_is_grant(t) )
         {
@@ -908,22 +915,7 @@ guest_physmap_add_entry(struct domain *d
                                   gfn_x(gfn_add(gfn, i)));
         }
     }
-    else
-    {
-        gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n",
-                 gfn_x(gfn), mfn_x(mfn));
-        rc = p2m_set_entry(p2m, gfn, INVALID_MFN, page_order,
-                           p2m_invalid, p2m->default_access);
-        if ( rc == 0 )
-        {
-            pod_lock(p2m);
-            p2m->pod.entry_count -= pod_count;
-            BUG_ON(p2m->pod.entry_count < 0);
-            pod_unlock(p2m);
-        }
-    }
 
-out:
     p2m_unlock(p2m);
 
     return rc;

[-- Attachment #7: xsa378/xsa378-4.11-0b.patch --]
[-- Type: application/octet-stream, Size: 2039 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't ignore p2m_remove_page()'s return value

It's not very nice to return from guest_physmap_add_entry() after
perhaps already having made some changes to the P2M, but this is pre-
existing practice in the function, and imo better than ignoring errors.

Take the liberty and replace an mfn_add() instance with a local variable
already holding the result (as proven by the check immediately ahead).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -702,8 +702,7 @@ void p2m_final_teardown(struct domain *d
     p2m_teardown_hostp2m(d);
 }
 
-
-static int
+static int __must_check
 p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn_l, unsigned long mfn,
                 unsigned int page_order)
 {
@@ -892,9 +891,9 @@ guest_physmap_add_entry(struct domain *d
                 ASSERT(mfn_valid(omfn));
                 P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n",
                           gfn_x(ogfn) , mfn_x(omfn));
-                if ( mfn_eq(omfn, mfn_add(mfn, i)) )
-                    p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(mfn_add(mfn, i)),
-                                    0);
+                if ( mfn_eq(omfn, mfn_add(mfn, i)) &&
+                     (rc = p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(omfn), 0)) )
+                    goto out;
             }
         }
     }
@@ -916,6 +915,7 @@ guest_physmap_add_entry(struct domain *d
         }
     }
 
+ out:
     p2m_unlock(p2m);
 
     return rc;
@@ -2385,9 +2385,9 @@ int p2m_change_altp2m_gfn(struct domain
 
     if ( gfn_eq(new_gfn, INVALID_GFN) )
     {
-        if ( mfn_valid(mfn) )
-            p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K);
-        rc = 0;
+        rc = mfn_valid(mfn)
+             ? p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K)
+             : 0;
         goto out;
     }
 

[-- Attachment #8: xsa378/xsa378-4.11-0c.patch --]
[-- Type: application/octet-stream, Size: 2339 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't assert that the passed in MFN matches for a remove

guest_physmap_remove_page() gets handed an MFN from the outside, yet
takes the necessary lock to prevent further changes to the GFN <-> MFN
mapping itself. While some callers, in particular guest_remove_page()
(by way of having called get_gfn_query()), hold the GFN lock already,
various others (most notably perhaps the 2nd instance in
xenmem_add_to_physmap_one()) don't. While it also is an option to fix
all the callers, deal with the issue in p2m_remove_page() instead:
Replace the ASSERT() by a conditional and split the loop into two, such
that all checking gets done before any modification would occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -708,7 +708,6 @@ p2m_remove_page(struct p2m_domain *p2m,
 {
     unsigned long i;
     gfn_t gfn = _gfn(gfn_l);
-    mfn_t mfn_return;
     p2m_type_t t;
     p2m_access_t a;
 
@@ -719,15 +718,26 @@ p2m_remove_page(struct p2m_domain *p2m,
     ASSERT(gfn_locked_by_me(p2m, gfn));
     P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn_l, mfn);
 
+    for ( i = 0; i < (1UL << page_order); )
+    {
+        unsigned int cur_order;
+        mfn_t mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
+                                          &cur_order, NULL);
+
+        if ( p2m_is_valid(t) &&
+             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+            return -EILSEQ;
+
+        i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
+    }
+
     if ( mfn_valid(_mfn(mfn)) )
     {
         for ( i = 0; i < (1UL << page_order); i++ )
         {
-            mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
-                                        NULL, NULL);
+            p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
             if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
-            ASSERT( !p2m_is_valid(t) || mfn + i == mfn_x(mfn_return) );
         }
     }
     return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,

[-- Attachment #9: xsa378/xsa378-4.11-1.patch --]
[-- Type: application/octet-stream, Size: 5112 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -98,12 +98,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -111,16 +120,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -158,6 +162,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -179,10 +184,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -193,6 +203,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -222,12 +233,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
         ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -237,6 +249,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -267,8 +280,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #10: xsa378/xsa378-4.11-2.patch --]
[-- Type: application/octet-stream, Size: 8863 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -127,32 +127,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_ENABLED;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -175,13 +191,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -223,15 +239,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -262,15 +278,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -279,7 +295,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1187,7 +1187,6 @@ static int __init alloc_ivrs_mappings(u1
     {
         ivrs_mappings[bdf].dte_requestor_id = bdf;
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_DISABLED;
-        ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_DISABLED;
         ivrs_mappings[bdf].iommu = NULL;
 
         ivrs_mappings[bdf].intremap_table = NULL;
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -372,15 +372,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);
--- a/xen/include/asm-x86/amd-iommu.h
+++ b/xen/include/asm-x86/amd-iommu.h
@@ -108,15 +108,19 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     u16 dte_requestor_id;
     u8 dte_allow_exclusion;
-    u8 unity_map_enable;
-    u8 write_permission;
-    u8 read_permission;
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;

[-- Attachment #11: xsa378/xsa378-4.11-3.patch --]
[-- Type: application/octet-stream, Size: 4138 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -711,7 +711,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -837,8 +837,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -471,6 +471,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Returns: 0 for success, -errno for failure */
 static int
 p2m_pt_set_entry(struct p2m_domain *p2m, gfn_t gfn_, mfn_t mfn,
@@ -487,7 +497,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -556,6 +566,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -602,9 +613,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -648,6 +660,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -839,7 +839,8 @@ int p2m_altp2m_propagate_change(struct d
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #12: xsa378/xsa378-4.11-4.patch --]
[-- Type: application/octet-stream, Size: 12488 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1157,7 +1157,8 @@ int set_identity_p2m_entry(struct domain
     {
         if ( !need_iommu(d) )
             return 0;
-        return iommu_map_page(d, gfn_l, gfn_l, IOMMUF_readable|IOMMUF_writable);
+        return iommu_map_page(d, gfn_l, gfn_l,
+                              p2m_access_to_iommu_flags(p2ma));
     }
 
     gfn_lock(p2m, gfn, 0);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,12 +42,6 @@
 #include "vtd.h"
 #include "../ats.h"
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1785,16 +1779,11 @@ out:
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     if ( iommu_use_hap_pt(d) )
         return;
@@ -1903,74 +1892,6 @@ static void iommu_set_pgd(struct domain
         pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2002,7 +1923,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2047,7 +1970,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2214,7 +2138,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2371,7 +2296,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2468,7 +2395,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -144,7 +144,7 @@ int arch_iommu_domain_init(struct domain
     struct domain_iommu *hd = dom_iommu(d);
 
     spin_lock_init(&hd->arch.mapping_lock);
-    INIT_LIST_HEAD(&hd->arch.mapped_rmrrs);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -153,6 +153,99 @@ void arch_iommu_domain_destroy(struct do
 {
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/processor.h>
 #include <asm/hvm/vmx/vmcs.h>
@@ -36,7 +37,7 @@ struct arch_iommu
     spinlock_t mapping_lock;            /* io page table lock */
     int agaw;     /* adjusted guest address width, 0 is level 2 30-bit */
     u64 iommu_bitmap;              /* bitmap of iommu(s) that the domain uses */
-    struct list_head mapped_rmrrs;
+    struct list_head identity_maps;
 
     /* amd iommu support */
     int paging_mode;
@@ -94,6 +95,11 @@ bool_t iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/mem_access.h
+++ b/xen/include/asm-x86/mem_access.h
@@ -44,10 +44,8 @@ bool p2m_mem_access_emulate_check(struct
                                   const vm_event_response_t *rsp);
 
 /* Sanity check for mem_access hardware support */
-static inline bool p2m_mem_access_sanity_check(struct domain *d)
-{
-    return is_hvm_domain(d) && cpu_has_vmx && hap_enabled(d);
-}
+#define p2m_mem_access_sanity_check(d) \
+    (is_hvm_domain(d) && cpu_has_vmx && hap_enabled(d))
 
 #endif /*__ASM_X86_MEM_ACCESS_H__ */
 
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -836,6 +836,34 @@ int p2m_altp2m_propagate_change(struct d
                                 mfn_t mfn, unsigned int page_order,
                                 p2m_type_t p2mt, p2m_access_t p2ma);
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -857,9 +885,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #13: xsa378/xsa378-4.11-5.patch --]
[-- Type: application/octet-stream, Size: 6969 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -716,27 +716,49 @@ int amd_iommu_unmap_page(struct domain *
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       u64 phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        rt = amd_iommu_map_page(domain, gfn +i, gfn +i, flags);
-        if ( rt != 0 )
-            return rt;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
+
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
-    return 0;
+
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
+
+    return rc;
 }
 
 /* Share p2m table with iommu. */
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -333,6 +333,7 @@ static int reassign_device(struct domain
     struct amd_iommu *iommu;
     int bdf, rc;
     struct domain_iommu *t = dom_iommu(target);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -347,10 +348,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
+    {
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
-        list_move(&pdev->domain_list, &target->arch.pdev_list);
-        pdev->domain = target;
+        list_move(&pdev->domain_list, &dom_io->arch.pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = allocate_domain_resources(t);
@@ -362,6 +377,12 @@ static int reassign_device(struct domain
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->arch.pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -372,20 +393,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
+
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for d%d/%04x:%02x:%02x.%u failed (%d)\n",
+                   d->domain_id, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void deallocate_next_page_table(struct page_info *pg, int level)
@@ -451,6 +480,7 @@ static void deallocate_iommu_page_tables
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     deallocate_iommu_page_tables(d);
     amd_iommu_flush_all_pages(d);
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -60,8 +60,10 @@ int __must_check amd_iommu_unmap_page(st
 u64 amd_iommu_get_next_table_from_pte(u32 *entry);
 int __must_check amd_iommu_alloc_root(struct domain_iommu *hd);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       u64 phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 
 /* Share p2m table with iommu */
 void amd_iommu_share_p2m(struct domain *d);

[-- Attachment #14: xsa378/xsa378-4.11-6.patch --]
[-- Type: application/octet-stream, Size: 15628 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -99,12 +99,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -133,14 +129,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -168,55 +168,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -230,27 +227,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
         ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
     }
@@ -258,53 +247,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -315,12 +293,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -342,15 +320,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -365,14 +343,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -389,13 +367,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -406,20 +382,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -218,6 +218,8 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -229,11 +231,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -126,6 +126,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 

[-- Attachment #15: xsa378/xsa378-4.11-7.patch --]
[-- Type: application/octet-stream, Size: 3353 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -736,7 +736,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
         }
     }
@@ -848,13 +848,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -947,8 +947,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1091,11 +1090,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -142,6 +142,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -170,6 +174,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #16: xsa378/xsa378-4.11-8.patch --]
[-- Type: application/octet-stream, Size: 5923 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4783,7 +4783,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = mfn_x(get_gfn(d, gfn_x(gpfn), &p2mt));
-    if ( mfn_valid(_mfn(prev_mfn)) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(_mfn(prev_mfn)) )
     {
         if ( is_xen_heap_mfn(prev_mfn) )
             /* Xen heap frames are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -725,7 +725,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+             (!mfn_valid(_mfn(mfn)) || t == p2m_mmio_direct ||
+              mfn + i != mfn_x(mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
@@ -803,7 +804,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -850,7 +851,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1192,8 +1193,8 @@ int set_identity_p2m_entry(struct domain
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1302,17 +1302,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -335,7 +335,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1651,6 +1651,15 @@ int prepare_ring_for_helper(
         return -ENOENT;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -144,7 +144,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -629,8 +630,6 @@ int set_foreign_p2m_entry(struct domain
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
                        unsigned int order, p2m_access_t access);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

[-- Attachment #17: xsa378/xsa378-4.12-0a.patch --]
[-- Type: application/octet-stream, Size: 2532 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: fix PoD accounting in guest_physmap_add_entry()

The initial observation was that the mfn_valid() check comes too late:
Neither mfn_add() nor mfn_to_page() (let alone de-referencing the
result of the latter) are valid for MFNs failing this check. Move it up
and - noticing that there's no caller doing so - also add an assertion
that this should never produce "false" here.

In turn this would have meant that the "else" to that if() could now go
away, which didn't seem right at all. And indeed, considering callers
like memory_exchange() or various grant table functions, the PoD
accounting should have been outside of that if() from the very
beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -864,6 +864,12 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
+    if ( !mfn_valid(mfn) )
+    {
+        ASSERT_UNREACHABLE();
+        return -EINVAL;
+    }
+
     p2m_lock(p2m);
 
     P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn_x(gfn), mfn_x(mfn));
@@ -963,12 +969,15 @@ guest_physmap_add_entry(struct domain *d
     }
 
     /* Now, actually do the two-way mapping */
-    if ( mfn_valid(mfn) )
+    rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, p2m->default_access);
+    if ( rc == 0 )
     {
-        rc = p2m_set_entry(p2m, gfn, mfn, page_order, t,
-                           p2m->default_access);
-        if ( rc )
-            goto out; /* Failed to update p2m, bail without updating m2p. */
+#ifdef CONFIG_HVM
+        pod_lock(p2m);
+        p2m->pod.entry_count -= pod_count;
+        BUG_ON(p2m->pod.entry_count < 0);
+        pod_unlock(p2m);
+#endif
 
         if ( !p2m_is_grant(t) )
         {
@@ -977,24 +986,7 @@ guest_physmap_add_entry(struct domain *d
                                   gfn_x(gfn_add(gfn, i)));
         }
     }
-    else
-    {
-        gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n",
-                 gfn_x(gfn), mfn_x(mfn));
-        rc = p2m_set_entry(p2m, gfn, INVALID_MFN, page_order,
-                           p2m_invalid, p2m->default_access);
-#ifdef CONFIG_HVM
-        if ( rc == 0 )
-        {
-            pod_lock(p2m);
-            p2m->pod.entry_count -= pod_count;
-            BUG_ON(p2m->pod.entry_count < 0);
-            pod_unlock(p2m);
-        }
-#endif
-    }
 
-out:
     p2m_unlock(p2m);
 
     return rc;

[-- Attachment #18: xsa378/xsa378-4.12-0b.patch --]
[-- Type: application/octet-stream, Size: 2039 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't ignore p2m_remove_page()'s return value

It's not very nice to return from guest_physmap_add_entry() after
perhaps already having made some changes to the P2M, but this is pre-
existing practice in the function, and imo better than ignoring errors.

Take the liberty and replace an mfn_add() instance with a local variable
already holding the result (as proven by the check immediately ahead).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -772,8 +772,7 @@ void p2m_final_teardown(struct domain *d
     p2m_teardown_hostp2m(d);
 }
 
-
-static int
+static int __must_check
 p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn_l, unsigned long mfn,
                 unsigned int page_order)
 {
@@ -961,9 +960,9 @@ guest_physmap_add_entry(struct domain *d
                 ASSERT(mfn_valid(omfn));
                 P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n",
                           gfn_x(ogfn) , mfn_x(omfn));
-                if ( mfn_eq(omfn, mfn_add(mfn, i)) )
-                    p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(mfn_add(mfn, i)),
-                                    0);
+                if ( mfn_eq(omfn, mfn_add(mfn, i)) &&
+                     (rc = p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(omfn), 0)) )
+                    goto out;
             }
         }
     }
@@ -987,6 +986,7 @@ guest_physmap_add_entry(struct domain *d
         }
     }
 
+ out:
     p2m_unlock(p2m);
 
     return rc;
@@ -2646,9 +2646,9 @@ int p2m_change_altp2m_gfn(struct domain
 
     if ( gfn_eq(new_gfn, INVALID_GFN) )
     {
-        if ( mfn_valid(mfn) )
-            p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K);
-        rc = 0;
+        rc = mfn_valid(mfn)
+             ? p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K)
+             : 0;
         goto out;
     }
 

[-- Attachment #19: xsa378/xsa378-4.12-0c.patch --]
[-- Type: application/octet-stream, Size: 2339 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't assert that the passed in MFN matches for a remove

guest_physmap_remove_page() gets handed an MFN from the outside, yet
takes the necessary lock to prevent further changes to the GFN <-> MFN
mapping itself. While some callers, in particular guest_remove_page()
(by way of having called get_gfn_query()), hold the GFN lock already,
various others (most notably perhaps the 2nd instance in
xenmem_add_to_physmap_one()) don't. While it also is an option to fix
all the callers, deal with the issue in p2m_remove_page() instead:
Replace the ASSERT() by a conditional and split the loop into two, such
that all checking gets done before any modification would occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -778,7 +778,6 @@ p2m_remove_page(struct p2m_domain *p2m,
 {
     unsigned long i;
     gfn_t gfn = _gfn(gfn_l);
-    mfn_t mfn_return;
     p2m_type_t t;
     p2m_access_t a;
 
@@ -789,15 +788,26 @@ p2m_remove_page(struct p2m_domain *p2m,
     ASSERT(gfn_locked_by_me(p2m, gfn));
     P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn_l, mfn);
 
+    for ( i = 0; i < (1UL << page_order); )
+    {
+        unsigned int cur_order;
+        mfn_t mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
+                                          &cur_order, NULL);
+
+        if ( p2m_is_valid(t) &&
+             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+            return -EILSEQ;
+
+        i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
+    }
+
     if ( mfn_valid(_mfn(mfn)) )
     {
         for ( i = 0; i < (1UL << page_order); i++ )
         {
-            mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
-                                        NULL, NULL);
+            p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
             if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
-            ASSERT( !p2m_is_valid(t) || mfn + i == mfn_x(mfn_return) );
         }
     }
     return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,

[-- Attachment #20: xsa378/xsa378-4.12-1.patch --]
[-- Type: application/octet-stream, Size: 5112 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -98,12 +98,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -111,16 +120,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -158,6 +162,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -179,10 +184,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -193,6 +203,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -222,12 +233,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
         ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -237,6 +249,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -267,8 +280,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #21: xsa378/xsa378-4.12-2.patch --]
[-- Type: application/octet-stream, Size: 8863 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -127,32 +127,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_ENABLED;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -175,13 +191,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -223,15 +239,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -262,15 +278,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -279,7 +295,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1189,7 +1189,6 @@ static int __init alloc_ivrs_mappings(u1
     {
         ivrs_mappings[bdf].dte_requestor_id = bdf;
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_DISABLED;
-        ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_DISABLED;
         ivrs_mappings[bdf].iommu = NULL;
 
         ivrs_mappings[bdf].intremap_table = NULL;
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -346,15 +346,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);
--- a/xen/include/asm-x86/amd-iommu.h
+++ b/xen/include/asm-x86/amd-iommu.h
@@ -108,15 +108,19 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     u16 dte_requestor_id;
     u8 dte_allow_exclusion;
-    u8 unity_map_enable;
-    u8 write_permission;
-    u8 read_permission;
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;

[-- Attachment #22: xsa378/xsa378-4.12-3.patch --]
[-- Type: application/octet-stream, Size: 4192 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -672,7 +672,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -799,8 +799,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -512,6 +512,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Checks only applicable to entries with order > PAGE_ORDER_4K */
 static void check_entry(mfn_t mfn, p2m_type_t new, p2m_type_t old,
                         unsigned int order)
@@ -546,7 +556,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -609,6 +619,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -654,9 +665,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -687,6 +699,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -899,7 +899,8 @@ static inline void p2m_altp2m_check(stru
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #23: xsa378/xsa378-4.12-4.patch --]
[-- Type: application/octet-stream, Size: 12040 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1341,7 +1341,7 @@ int set_identity_p2m_entry(struct domain
         if ( !has_iommu_pt(d) )
             return 0;
         return iommu_legacy_map(d, _dfn(gfn_l), _mfn(gfn_l), PAGE_ORDER_4K,
-                                IOMMUF_readable | IOMMUF_writable);
+                                p2m_access_to_iommu_flags(p2ma));
     }
 
     gfn_lock(p2m, gfn, 0);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,12 +42,6 @@
 #include "vtd.h"
 #include "../ats.h"
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1787,16 +1781,11 @@ out:
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     ASSERT(iommu_enabled);
 
@@ -1955,74 +1944,6 @@ static void iommu_set_pgd(struct domain
         pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2054,7 +1975,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2099,7 +2022,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2266,7 +2190,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2425,7 +2350,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2522,7 +2449,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -150,7 +150,7 @@ int arch_iommu_domain_init(struct domain
     struct domain_iommu *hd = dom_iommu(d);
 
     spin_lock_init(&hd->arch.mapping_lock);
-    INIT_LIST_HEAD(&hd->arch.mapped_rmrrs);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -159,6 +159,99 @@ void arch_iommu_domain_destroy(struct do
 {
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
                                          unsigned long pfn,
                                          unsigned long max_pfn)
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/processor.h>
 #include <asm/hvm/vmx/vmcs.h>
@@ -48,7 +49,7 @@ struct arch_iommu
     spinlock_t mapping_lock;            /* io page table lock */
     int agaw;     /* adjusted guest address width, 0 is level 2 30-bit */
     u64 iommu_bitmap;              /* bitmap of iommu(s) that the domain uses */
-    struct list_head mapped_rmrrs;
+    struct list_head identity_maps;
 
     /* amd iommu support */
     int paging_mode;
@@ -94,6 +95,11 @@ bool_t iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -896,6 +896,34 @@ struct p2m_domain *p2m_get_altp2m(struct
 static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx) {}
 #endif
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -917,9 +945,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #24: xsa378/xsa378-4.12-5.patch --]
[-- Type: application/octet-stream, Size: 7457 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -613,38 +613,49 @@ int amd_iommu_flush_iotlb_all(struct dom
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    unsigned int flush_flags = 0;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        unsigned long frame = gfn + i;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
 
-        rt = amd_iommu_map_page(domain, _dfn(frame), _mfn(frame), flags,
-                                &flush_flags);
-        if ( rt != 0 )
-            break;
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
 
-    /* Use while-break to avoid compiler warning */
-    while ( flush_flags &&
-            amd_iommu_flush_iotlb_pages(domain, _dfn(gfn),
-                                        npages, flush_flags) )
-        break;
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
 
-    return rt;
+    return rc;
 }
 
 /* Share p2m table with iommu. */
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -307,6 +307,7 @@ static int reassign_device(struct domain
     struct amd_iommu *iommu;
     int bdf, rc;
     struct domain_iommu *t = dom_iommu(target);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -321,10 +322,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
+    {
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
-        list_move(&pdev->domain_list, &target->arch.pdev_list);
-        pdev->domain = target;
+        list_move(&pdev->domain_list, &dom_io->arch.pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = allocate_domain_resources(t);
@@ -336,6 +351,12 @@ static int reassign_device(struct domain
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->arch.pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -346,20 +367,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
+
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
+                   d, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void deallocate_next_page_table(struct page_info *pg, int level)
@@ -425,6 +454,7 @@ static void deallocate_iommu_page_tables
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     deallocate_iommu_page_tables(d);
     amd_iommu_flush_all_pages(d);
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -62,8 +62,10 @@ int __must_check amd_iommu_unmap_page(st
 uint64_t amd_iommu_get_address_from_pte(void *entry);
 int __must_check amd_iommu_alloc_root(struct domain_iommu *hd);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned int page_count,
                                              unsigned int flush_flags);

[-- Attachment #25: xsa378/xsa378-4.12-6.patch --]
[-- Type: application/octet-stream, Size: 15628 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -99,12 +99,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -133,14 +129,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -168,55 +168,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -230,27 +227,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
         ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED;
     }
@@ -258,53 +247,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -315,12 +293,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -342,15 +320,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -365,14 +343,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -389,13 +367,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -406,20 +382,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -222,6 +222,8 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -233,11 +235,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -132,6 +132,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 

[-- Attachment #26: xsa378/xsa378-4.12-7.patch --]
[-- Type: application/octet-stream, Size: 3355 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -806,7 +806,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
         }
     }
@@ -917,13 +917,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -1018,8 +1018,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1272,11 +1271,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -141,6 +141,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -169,6 +173,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #27: xsa378/xsa378-4.12-8.patch --]
[-- Type: application/octet-stream, Size: 5935 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4809,7 +4809,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = mfn_x(get_gfn(d, gfn_x(gpfn), &p2mt));
-    if ( mfn_valid(_mfn(prev_mfn)) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(_mfn(prev_mfn)) )
     {
         if ( is_xen_heap_mfn(prev_mfn) )
             /* Xen heap frames are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -795,7 +795,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+             (!mfn_valid(_mfn(mfn)) || t == p2m_mmio_direct ||
+              mfn + i != mfn_x(mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
@@ -873,7 +874,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -919,7 +920,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1375,8 +1376,8 @@ int set_identity_p2m_entry(struct domain
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1297,17 +1297,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -336,7 +336,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1724,6 +1724,15 @@ int check_get_page_from_gfn(struct domai
         return -EAGAIN;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -143,7 +143,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -640,8 +641,6 @@ int set_foreign_p2m_entry(struct domain
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
                        unsigned int order, p2m_access_t access);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

[-- Attachment #28: xsa378/xsa378-4.13-0a.patch --]
[-- Type: application/octet-stream, Size: 2574 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: fix PoD accounting in guest_physmap_add_entry()

The initial observation was that the mfn_valid() check comes too late:
Neither mfn_add() nor mfn_to_page() (let alone de-referencing the
result of the latter) are valid for MFNs failing this check. Move it up
and - noticing that there's no caller doing so - also add an assertion
that this should never produce "false" here.

In turn this would have meant that the "else" to that if() could now go
away, which didn't seem right at all. And indeed, considering callers
like memory_exchange() or various grant table functions, the PoD
accounting should have been outside of that if() from the very
beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: aea270e3f7c0db696c88a0e94b1ece7abd339c84
master date: 2020-02-21 17:14:38 +0100

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -881,6 +881,12 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
+    if ( !mfn_valid(mfn) )
+    {
+        ASSERT_UNREACHABLE();
+        return -EINVAL;
+    }
+
     p2m_lock(p2m);
 
     P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn_x(gfn), mfn_x(mfn));
@@ -981,12 +987,13 @@ guest_physmap_add_entry(struct domain *d
     }
 
     /* Now, actually do the two-way mapping */
-    if ( mfn_valid(mfn) )
+    rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, p2m->default_access);
+    if ( rc == 0 )
     {
-        rc = p2m_set_entry(p2m, gfn, mfn, page_order, t,
-                           p2m->default_access);
-        if ( rc )
-            goto out; /* Failed to update p2m, bail without updating m2p. */
+        pod_lock(p2m);
+        p2m->pod.entry_count -= pod_count;
+        BUG_ON(p2m->pod.entry_count < 0);
+        pod_unlock(p2m);
 
         if ( !p2m_is_grant(t) )
         {
@@ -995,22 +1002,7 @@ guest_physmap_add_entry(struct domain *d
                                   gfn_x(gfn_add(gfn, i)));
         }
     }
-    else
-    {
-        gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n",
-                 gfn_x(gfn), mfn_x(mfn));
-        rc = p2m_set_entry(p2m, gfn, INVALID_MFN, page_order,
-                           p2m_invalid, p2m->default_access);
-        if ( rc == 0 )
-        {
-            pod_lock(p2m);
-            p2m->pod.entry_count -= pod_count;
-            BUG_ON(p2m->pod.entry_count < 0);
-            pod_unlock(p2m);
-        }
-    }
 
-out:
     p2m_unlock(p2m);
 
     return rc;

[-- Attachment #29: xsa378/xsa378-4.13-0b.patch --]
[-- Type: application/octet-stream, Size: 2204 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't ignore p2m_remove_page()'s return value

It's not very nice to return from guest_physmap_add_entry() after
perhaps already having made some changes to the P2M, but this is pre-
existing practice in the function, and imo better than ignoring errors.

Take the liberty and replace an mfn_add() instance with a local variable
already holding the result (as proven by the check immediately ahead).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a6b051a87a586347969bfbaa6925ac0f0c845413
master date: 2020-04-03 10:56:10 +0200

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -773,8 +773,7 @@ void p2m_final_teardown(struct domain *d
     p2m_teardown_hostp2m(d);
 }
 
-
-static int
+static int __must_check
 p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn_l, unsigned long mfn,
                 unsigned int page_order)
 {
@@ -979,9 +978,9 @@ guest_physmap_add_entry(struct domain *d
                 ASSERT(mfn_valid(omfn));
                 P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n",
                           gfn_x(ogfn) , mfn_x(omfn));
-                if ( mfn_eq(omfn, mfn_add(mfn, i)) )
-                    p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(mfn_add(mfn, i)),
-                                    0);
+                if ( mfn_eq(omfn, mfn_add(mfn, i)) &&
+                     (rc = p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(omfn), 0)) )
+                    goto out;
             }
         }
     }
@@ -1003,6 +1002,7 @@ guest_physmap_add_entry(struct domain *d
         }
     }
 
+ out:
     p2m_unlock(p2m);
 
     return rc;
@@ -2690,9 +2690,9 @@ int p2m_change_altp2m_gfn(struct domain
     if ( gfn_eq(new_gfn, INVALID_GFN) )
     {
         mfn = ap2m->get_entry(ap2m, old_gfn, &t, &a, 0, NULL, NULL);
-        if ( mfn_valid(mfn) )
-            p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K);
-        rc = 0;
+        rc = mfn_valid(mfn)
+             ? p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K)
+             : 0;
         goto out;
     }
 

[-- Attachment #30: xsa378/xsa378-4.13-0c.patch --]
[-- Type: application/octet-stream, Size: 2434 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: don't assert that the passed in MFN matches for a remove

guest_physmap_remove_page() gets handed an MFN from the outside, yet
takes the necessary lock to prevent further changes to the GFN <-> MFN
mapping itself. While some callers, in particular guest_remove_page()
(by way of having called get_gfn_query()), hold the GFN lock already,
various others (most notably perhaps the 2nd instance in
xenmem_add_to_physmap_one()) don't. While it also is an option to fix
all the callers, deal with the issue in p2m_remove_page() instead:
Replace the ASSERT() by a conditional and split the loop into two, such
that all checking gets done before any modification would occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c65ea16dbcafbe4fe21693b18f8c2a3c5d14600e
master date: 2020-04-03 10:56:55 +0200

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -779,7 +779,6 @@ p2m_remove_page(struct p2m_domain *p2m,
 {
     unsigned long i;
     gfn_t gfn = _gfn(gfn_l);
-    mfn_t mfn_return;
     p2m_type_t t;
     p2m_access_t a;
 
@@ -790,15 +789,26 @@ p2m_remove_page(struct p2m_domain *p2m,
     ASSERT(gfn_locked_by_me(p2m, gfn));
     P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn_l, mfn);
 
+    for ( i = 0; i < (1UL << page_order); )
+    {
+        unsigned int cur_order;
+        mfn_t mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
+                                          &cur_order, NULL);
+
+        if ( p2m_is_valid(t) &&
+             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+            return -EILSEQ;
+
+        i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
+    }
+
     if ( mfn_valid(_mfn(mfn)) )
     {
         for ( i = 0; i < (1UL << page_order); i++ )
         {
-            mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0,
-                                        NULL, NULL);
+            p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
             if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
-            ASSERT( !p2m_is_valid(t) || mfn + i == mfn_x(mfn_return) );
         }
     }
     return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,

[-- Attachment #31: xsa378/xsa378-4.13-1.patch --]
[-- Type: application/octet-stream, Size: 5080 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -117,12 +117,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -130,16 +139,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -177,6 +181,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -198,10 +203,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -212,6 +222,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -241,12 +252,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -256,6 +268,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -286,8 +299,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #32: xsa378/xsa378-4.13-2.patch --]
[-- Type: application/octet-stream, Size: 8498 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -146,32 +146,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = true;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -194,13 +210,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -242,15 +258,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -281,15 +297,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -298,7 +314,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -369,15 +369,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);
--- a/xen/include/asm-x86/amd-iommu.h
+++ b/xen/include/asm-x86/amd-iommu.h
@@ -105,20 +105,24 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     uint16_t dte_requestor_id;
     bool valid:1;
     bool dte_allow_exclusion:1;
-    bool unity_map_enable:1;
-    bool write_permission:1;
-    bool read_permission:1;
 
     /* ivhd device data settings */
     uint8_t device_flags;
 
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;

[-- Attachment #33: xsa378/xsa378-4.13-3.patch --]
[-- Type: application/octet-stream, Size: 4192 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -678,7 +678,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -805,8 +805,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -480,6 +480,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Checks only applicable to entries with order > PAGE_ORDER_4K */
 static void check_entry(mfn_t mfn, p2m_type_t new, p2m_type_t old,
                         unsigned int order)
@@ -514,7 +524,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -577,6 +587,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -619,9 +630,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -649,6 +661,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -908,7 +908,8 @@ static inline void p2m_altp2m_check(stru
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #34: xsa378/xsa378-4.13-4.patch --]
[-- Type: application/octet-stream, Size: 12063 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1351,7 +1351,7 @@ int set_identity_p2m_entry(struct domain
         if ( !is_iommu_enabled(d) )
             return 0;
         return iommu_legacy_map(d, _dfn(gfn_l), _mfn(gfn_l), PAGE_ORDER_4K,
-                                IOMMUF_readable | IOMMUF_writable);
+                                p2m_access_to_iommu_flags(p2ma));
     }
 
     gfn_lock(p2m, gfn, 0);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,12 +42,6 @@
 #include "vtd.h"
 #include "../ats.h"
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1799,17 +1793,12 @@ out:
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
     const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     ASSERT(is_iommu_enabled(d));
 
@@ -1963,74 +1952,6 @@ static void iommu_set_pgd(struct domain
         pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2062,7 +1983,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2107,7 +2030,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2306,7 +2230,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2465,7 +2390,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2562,7 +2489,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -127,7 +127,7 @@ int arch_iommu_domain_init(struct domain
     struct domain_iommu *hd = dom_iommu(d);
 
     spin_lock_init(&hd->arch.mapping_lock);
-    INIT_LIST_HEAD(&hd->arch.mapped_rmrrs);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -136,6 +136,99 @@ void arch_iommu_domain_destroy(struct do
 {
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
                                          unsigned long pfn,
                                          unsigned long max_pfn)
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/apicdef.h>
 #include <asm/processor.h>
@@ -49,7 +50,7 @@ struct arch_iommu
     spinlock_t mapping_lock;            /* io page table lock */
     int agaw;     /* adjusted guest address width, 0 is level 2 30-bit */
     u64 iommu_bitmap;              /* bitmap of iommu(s) that the domain uses */
-    struct list_head mapped_rmrrs;
+    struct list_head identity_maps;
 
     /* amd iommu support */
     int paging_mode;
@@ -112,6 +113,11 @@ static inline void iommu_disable_x2apic(
         iommu_ops.disable_x2apic();
 }
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -905,6 +905,34 @@ struct p2m_domain *p2m_get_altp2m(struct
 static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx) {}
 #endif
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -926,9 +954,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #35: xsa378/xsa378-4.13-5.patch --]
[-- Type: application/octet-stream, Size: 7475 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -430,38 +430,49 @@ int amd_iommu_flush_iotlb_all(struct dom
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    unsigned int flush_flags = 0;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        unsigned long frame = gfn + i;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
 
-        rt = amd_iommu_map_page(domain, _dfn(frame), _mfn(frame), flags,
-                                &flush_flags);
-        if ( rt != 0 )
-            break;
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
 
-    /* Use while-break to avoid compiler warning */
-    while ( flush_flags &&
-            amd_iommu_flush_iotlb_pages(domain, _dfn(gfn),
-                                        npages, flush_flags) )
-        break;
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
 
-    return rt;
+    return rc;
 }
 
 int __init amd_iommu_quarantine_init(struct domain *d)
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -330,6 +330,7 @@ static int reassign_device(struct domain
     struct amd_iommu *iommu;
     int bdf, rc;
     struct domain_iommu *t = dom_iommu(target);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -344,10 +345,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
+    {
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
+        list_move(&pdev->domain_list, &dom_io->pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = allocate_domain_resources(t);
@@ -359,6 +374,12 @@ static int reassign_device(struct domain
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -369,20 +390,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
+
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
+                   d, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void deallocate_next_page_table(struct page_info *pg, int level)
@@ -441,6 +470,7 @@ static void deallocate_iommu_page_tables
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     deallocate_iommu_page_tables(d);
     amd_iommu_flush_all_pages(d);
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -64,8 +64,10 @@ int __must_check amd_iommu_unmap_page(st
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain_iommu *hd);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned int page_count,
                                              unsigned int flush_flags);

[-- Attachment #36: xsa378/xsa378-4.13-6.patch --]
[-- Type: application/octet-stream, Size: 15596 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -118,12 +118,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -152,14 +148,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -187,55 +187,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -249,27 +246,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
@@ -277,53 +266,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -334,12 +312,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -361,15 +339,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -384,14 +362,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -408,13 +386,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -425,20 +401,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -234,6 +234,8 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -245,11 +247,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -140,6 +140,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 

[-- Attachment #37: xsa378/xsa378-4.13-7.patch --]
[-- Type: application/octet-stream, Size: 3355 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -807,7 +807,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
         }
     }
@@ -934,13 +934,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -1034,8 +1034,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1282,11 +1281,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -141,6 +141,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -169,6 +173,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #38: xsa378/xsa378-4.13-8.patch --]
[-- Type: application/octet-stream, Size: 5916 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4770,7 +4770,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = get_gfn(d, gfn_x(gpfn), &p2mt);
-    if ( mfn_valid(prev_mfn) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(prev_mfn) )
     {
         if ( is_xen_heap_mfn(prev_mfn) )
             /* Xen heap frames are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -796,7 +796,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) )
+             (!mfn_valid(_mfn(mfn)) || t == p2m_mmio_direct ||
+              mfn + i != mfn_x(mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1));
@@ -890,7 +891,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -936,7 +937,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1385,8 +1386,8 @@ int set_identity_p2m_entry(struct domain
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1295,17 +1295,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -328,7 +328,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1720,6 +1720,15 @@ int check_get_page_from_gfn(struct domai
         return -EAGAIN;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -143,7 +143,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -649,8 +650,6 @@ int set_foreign_p2m_entry(struct domain
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
                        unsigned int order, p2m_access_t access);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

[-- Attachment #39: xsa378/xsa378-4.14-1.patch --]
[-- Type: application/octet-stream, Size: 5080 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -117,12 +117,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -130,16 +139,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -177,6 +181,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -198,10 +203,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -212,6 +222,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -241,12 +252,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -256,6 +268,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -286,8 +299,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #40: xsa378/xsa378-4.14-2.patch --]
[-- Type: application/octet-stream, Size: 8506 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -107,20 +107,24 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     uint16_t dte_requestor_id;
     bool valid:1;
     bool dte_allow_exclusion:1;
-    bool unity_map_enable:1;
-    bool write_permission:1;
-    bool read_permission:1;
 
     /* ivhd device data settings */
     uint8_t device_flags;
 
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -146,32 +146,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = true;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -194,13 +210,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -242,15 +258,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -281,15 +297,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -298,7 +314,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -366,15 +366,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);

[-- Attachment #41: xsa378/xsa378-4.14-3.patch --]
[-- Type: application/octet-stream, Size: 4180 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -680,7 +680,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -808,8 +808,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -480,6 +480,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Checks only applicable to entries with order > PAGE_ORDER_4K */
 static void check_entry(mfn_t mfn, p2m_type_t new, p2m_type_t old,
                         unsigned int order)
@@ -514,7 +524,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -577,6 +587,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -619,9 +630,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -649,6 +661,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -903,7 +903,8 @@ static inline void p2m_altp2m_check(stru
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #42: xsa378/xsa378-4.14-4.patch --]
[-- Type: application/octet-stream, Size: 12063 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1353,7 +1353,7 @@ int set_identity_p2m_entry(struct domain
         if ( !is_iommu_enabled(d) )
             return 0;
         return iommu_legacy_map(d, _dfn(gfn_l), _mfn(gfn_l), PAGE_ORDER_4K,
-                                IOMMUF_readable | IOMMUF_writable);
+                                p2m_access_to_iommu_flags(p2ma));
     }
 
     gfn_lock(p2m, gfn, 0);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,12 +42,6 @@
 #include "vtd.h"
 #include "../ats.h"
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1800,17 +1794,12 @@ out:
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
     const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     ASSERT(is_iommu_enabled(d));
 
@@ -1966,74 +1955,6 @@ static void iommu_set_pgd(struct domain
         pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2065,7 +1986,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2110,7 +2033,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2309,7 +2233,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2480,7 +2405,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2577,7 +2504,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -139,7 +139,7 @@ int arch_iommu_domain_init(struct domain
     struct domain_iommu *hd = dom_iommu(d);
 
     spin_lock_init(&hd->arch.mapping_lock);
-    INIT_LIST_HEAD(&hd->arch.mapped_rmrrs);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -148,6 +148,99 @@ void arch_iommu_domain_destroy(struct do
 {
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
                                          unsigned long pfn,
                                          unsigned long max_pfn)
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/apicdef.h>
 #include <asm/processor.h>
@@ -49,7 +50,7 @@ struct arch_iommu
     spinlock_t mapping_lock;            /* io page table lock */
     int agaw;     /* adjusted guest address width, 0 is level 2 30-bit */
     u64 iommu_bitmap;              /* bitmap of iommu(s) that the domain uses */
-    struct list_head mapped_rmrrs;
+    struct list_head identity_maps;
 
     /* amd iommu support */
     int paging_mode;
@@ -112,6 +113,11 @@ static inline void iommu_disable_x2apic(
         iommu_ops.disable_x2apic();
 }
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -900,6 +900,34 @@ struct p2m_domain *p2m_get_altp2m(struct
 static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx) {}
 #endif
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -921,9 +949,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #43: xsa378/xsa378-4.14-5.patch --]
[-- Type: application/octet-stream, Size: 7457 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -232,8 +232,10 @@ int __must_check amd_iommu_unmap_page(st
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain_iommu *hd);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned int page_count,
                                              unsigned int flush_flags);
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -420,38 +420,49 @@ int amd_iommu_flush_iotlb_all(struct dom
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    unsigned int flush_flags = 0;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        unsigned long frame = gfn + i;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
 
-        rt = amd_iommu_map_page(domain, _dfn(frame), _mfn(frame), flags,
-                                &flush_flags);
-        if ( rt != 0 )
-            break;
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
 
-    /* Use while-break to avoid compiler warning */
-    while ( flush_flags &&
-            amd_iommu_flush_iotlb_pages(domain, _dfn(gfn),
-                                        npages, flush_flags) )
-        break;
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
 
-    return rt;
+    return rc;
 }
 
 int __init amd_iommu_quarantine_init(struct domain *d)
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -327,6 +327,7 @@ static int reassign_device(struct domain
     struct amd_iommu *iommu;
     int bdf, rc;
     struct domain_iommu *t = dom_iommu(target);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -341,10 +342,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
+    {
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
+        list_move(&pdev->domain_list, &dom_io->pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = allocate_domain_resources(t);
@@ -356,6 +371,12 @@ static int reassign_device(struct domain
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -366,20 +387,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
+
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
+                   d, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void deallocate_next_page_table(struct page_info *pg, int level)
@@ -438,6 +467,7 @@ static void deallocate_iommu_page_tables
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     deallocate_iommu_page_tables(d);
     amd_iommu_flush_all_pages(d);
 }

[-- Attachment #44: xsa378/xsa378-4.14-6.patch --]
[-- Type: application/octet-stream, Size: 15576 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -308,6 +308,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -118,12 +118,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -152,14 +148,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -187,55 +187,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -249,27 +246,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
@@ -277,53 +266,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -334,12 +312,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -361,15 +339,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -384,14 +362,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -408,13 +386,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -425,20 +401,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -231,6 +231,8 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -242,11 +244,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }

[-- Attachment #45: xsa378/xsa378-4.14-7.patch --]
[-- Type: application/octet-stream, Size: 3364 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -807,7 +807,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn_x(mfn) + i, INVALID_M2P_ENTRY);
         }
     }
@@ -935,13 +935,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -1035,8 +1035,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1283,11 +1282,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -141,6 +141,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -169,6 +173,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #46: xsa378/xsa378-4.14-8.patch --]
[-- Type: application/octet-stream, Size: 5860 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4652,7 +4652,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = get_gfn(d, gfn_x(gpfn), &p2mt);
-    if ( mfn_valid(prev_mfn) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(prev_mfn) )
     {
         if ( is_special_page(mfn_to_page(prev_mfn)) )
             /* Special pages are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -795,7 +795,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(mfn) || !mfn_eq(mfn_add(mfn, i), mfn_return)) )
+             (!mfn_valid(mfn) || t == p2m_mmio_direct ||
+              !mfn_eq(mfn_add(mfn, i), mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) -
@@ -893,7 +894,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -937,7 +938,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1387,8 +1388,8 @@ int set_identity_p2m_entry(struct domain
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1292,17 +1292,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -329,7 +329,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1721,6 +1721,15 @@ int check_get_page_from_gfn(struct domai
         return -EAGAIN;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -143,7 +143,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -645,8 +646,6 @@ int set_foreign_p2m_entry(struct domain
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, gfn_t gfn, mfn_t mfn,
                        unsigned int order);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

[-- Attachment #47: xsa378/xsa378-4.15-1.patch --]
[-- Type: application/octet-stream, Size: 5080 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -116,12 +116,21 @@ static struct amd_iommu * __init find_io
     return NULL;
 }
 
-static void __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit)
+static int __init reserve_iommu_exclusion_range(
+    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
+    bool all, bool iw, bool ir)
 {
+    if ( !ir || !iw )
+        return -EPERM;
+
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
+        if ( iommu->exclusion_limit + PAGE_SIZE < base ||
+             limit + PAGE_SIZE < iommu->exclusion_base ||
+             iommu->exclusion_allow_all != all )
+            return -EBUSY;
+
         if ( iommu->exclusion_base < base )
             base = iommu->exclusion_base;
         if ( iommu->exclusion_limit > limit )
@@ -129,16 +138,11 @@ static void __init reserve_iommu_exclusi
     }
 
     iommu->exclusion_enable = IOMMU_CONTROL_ENABLED;
+    iommu->exclusion_allow_all = all;
     iommu->exclusion_base = base;
     iommu->exclusion_limit = limit;
-}
 
-static void __init reserve_iommu_exclusion_range_all(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit)
-{
-    reserve_iommu_exclusion_range(iommu, base, limit);
-    iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED;
+    return 0;
 }
 
 static void __init reserve_unity_map_for_device(
@@ -176,6 +180,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     unsigned int bdf;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -197,10 +202,15 @@ static int __init register_exclusion_ran
     if ( limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
-            reserve_iommu_exclusion_range_all(iommu, base, limit);
+        {
+            rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                               true /* all */, iw, ir);
+            if ( rc )
+                break;
+        }
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_device(
@@ -211,6 +221,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
+    int rc = 0;
 
     iommu = find_iommu_for_device(seg, bdf);
     if ( !iommu )
@@ -240,12 +251,13 @@ static int __init register_exclusion_ran
     /* register IOMMU exclusion range settings for device */
     if ( limit >= iommu_top  )
     {
-        reserve_iommu_exclusion_range(iommu, base, limit);
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
 
-    return 0;
+    return rc;
 }
 
 static int __init register_exclusion_range_for_iommu_devices(
@@ -255,6 +267,7 @@ static int __init register_exclusion_ran
     unsigned long range_top, iommu_top, length;
     unsigned int bdf;
     u16 req;
+    int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
@@ -285,8 +298,10 @@ static int __init register_exclusion_ran
 
     /* register IOMMU exclusion range settings */
     if ( limit >= iommu_top )
-        reserve_iommu_exclusion_range_all(iommu, base, limit);
-    return 0;
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           true /* all */, iw, ir);
+
+    return rc;
 }
 
 static int __init parse_ivmd_device_select(

[-- Attachment #48: xsa378/xsa378-4.15-2.patch --]
[-- Type: application/octet-stream, Size: 8506 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -107,20 +107,24 @@ struct amd_iommu {
     struct list_head ats_devices;
 };
 
+struct ivrs_unity_map {
+    bool read:1;
+    bool write:1;
+    paddr_t addr;
+    unsigned long length;
+    struct ivrs_unity_map *next;
+};
+
 struct ivrs_mappings {
     uint16_t dte_requestor_id;
     bool valid:1;
     bool dte_allow_exclusion:1;
-    bool unity_map_enable:1;
-    bool write_permission:1;
-    bool read_permission:1;
 
     /* ivhd device data settings */
     uint8_t device_flags;
 
-    unsigned long addr_range_start;
-    unsigned long addr_range_length;
     struct amd_iommu *iommu;
+    struct ivrs_unity_map *unity_map;
 
     /* per device interrupt remapping table */
     void *intremap_table;
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -145,32 +145,48 @@ static int __init reserve_iommu_exclusio
     return 0;
 }
 
-static void __init reserve_unity_map_for_device(
-    u16 seg, u16 bdf, unsigned long base,
-    unsigned long length, u8 iw, u8 ir)
+static int __init reserve_unity_map_for_device(
+    uint16_t seg, uint16_t bdf, unsigned long base,
+    unsigned long length, bool iw, bool ir)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long old_top, new_top;
+    struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
 
-    /* need to extend unity-mapped range? */
-    if ( ivrs_mappings[bdf].unity_map_enable )
+    /* Check for overlaps. */
+    for ( ; unity_map; unity_map = unity_map->next )
     {
-        old_top = ivrs_mappings[bdf].addr_range_start +
-            ivrs_mappings[bdf].addr_range_length;
-        new_top = base + length;
-        if ( old_top > new_top )
-            new_top = old_top;
-        if ( ivrs_mappings[bdf].addr_range_start < base )
-            base = ivrs_mappings[bdf].addr_range_start;
-        length = new_top - base;
-    }
-
-    /* extend r/w permissioms and keep aggregate */
-    ivrs_mappings[bdf].write_permission = iw;
-    ivrs_mappings[bdf].read_permission = ir;
-    ivrs_mappings[bdf].unity_map_enable = true;
-    ivrs_mappings[bdf].addr_range_start = base;
-    ivrs_mappings[bdf].addr_range_length = length;
+        /*
+         * Exact matches are okay. This can in particular happen when
+         * register_exclusion_range_for_device() calls here twice for the
+         * same (s,b,d,f).
+         */
+        if ( base == unity_map->addr && length == unity_map->length &&
+             ir == unity_map->read && iw == unity_map->write )
+            return 0;
+
+        if ( unity_map->addr + unity_map->length > base &&
+             base + length > unity_map->addr )
+        {
+            AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n",
+                            base, base + length, unity_map->addr,
+                            unity_map->addr + unity_map->length);
+            return -EPERM;
+        }
+    }
+
+    /* Populate and insert a new unity map. */
+    unity_map = xmalloc(struct ivrs_unity_map);
+    if ( !unity_map )
+        return -ENOMEM;
+
+    unity_map->read = ir;
+    unity_map->write = iw;
+    unity_map->addr = base;
+    unity_map->length = length;
+    unity_map->next = ivrs_mappings[bdf].unity_map;
+    ivrs_mappings[bdf].unity_map = unity_map;
+
+    return 0;
 }
 
 static int __init register_exclusion_range_for_all_devices(
@@ -193,13 +209,13 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
-            reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
     {
         for_each_amd_iommu( iommu )
         {
@@ -241,15 +257,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve unity-mapped page entries for device */
         /* note: these entries are part of the exclusion range */
-        reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        reserve_unity_map_for_device(seg, req, base, length, iw, ir);
+        rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
+             reserve_unity_map_for_device(seg, req, base, length, iw, ir);
 
         /* push 'base' just outside of virtual address space */
         base = iommu_top;
     }
 
     /* register IOMMU exclusion range settings for device */
-    if ( limit >= iommu_top  )
+    if ( !rc && limit >= iommu_top  )
     {
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            false /* all */, iw, ir);
@@ -280,15 +296,15 @@ static int __init register_exclusion_ran
         length = range_top - base;
         /* reserve r/w unity-mapped page entries for devices */
         /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
+        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
         {
             if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
             {
-                reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                             iw, ir);
                 req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                             iw, ir);
+                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                                  iw, ir) ?:
+                     reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                                  iw, ir);
             }
         }
 
@@ -297,7 +313,7 @@ static int __init register_exclusion_ran
     }
 
     /* register IOMMU exclusion range settings */
-    if ( limit >= iommu_top )
+    if ( !rc && limit >= iommu_top )
         rc = reserve_iommu_exclusion_range(iommu, base, limit,
                                            true /* all */, iw, ir);
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -367,15 +367,17 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
+    const struct ivrs_unity_map *unity_map;
 
-    if ( ivrs_mappings[req_id].unity_map_enable )
+    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
+          unity_map = unity_map->next )
     {
-        amd_iommu_reserve_domain_unity_map(
-            d,
-            ivrs_mappings[req_id].addr_range_start,
-            ivrs_mappings[req_id].addr_range_length,
-            ivrs_mappings[req_id].write_permission,
-            ivrs_mappings[req_id].read_permission);
+        int rc = amd_iommu_reserve_domain_unity_map(
+                     d, unity_map->addr, unity_map->length,
+                     unity_map->write, unity_map->read);
+
+        if ( rc )
+            return rc;
     }
 
     return reassign_device(pdev->domain, d, devfn, pdev);

[-- Attachment #49: xsa378/xsa378-4.15-3.patch --]
[-- Type: application/octet-stream, Size: 4180 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -681,7 +681,7 @@ ept_set_entry(struct p2m_domain *p2m, gf
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -809,8 +809,8 @@ ept_set_entry(struct p2m_domain *p2m, gf
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
-             iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access,
+                                 _mfn(ept_entry->mfn)) == iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry);
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -545,6 +545,16 @@ int p2m_pt_handle_deferred_changes(uint6
     return rc;
 }
 
+/* Reconstruct a fake p2m_access_t from stored PTE flags. */
+static p2m_access_t p2m_flags_to_access(unsigned int flags)
+{
+    if ( flags & _PAGE_PRESENT )
+        return p2m_access_n;
+
+    /* No need to look at _PAGE_NX for now. */
+    return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r;
+}
+
 /* Checks only applicable to entries with order > PAGE_ORDER_4K */
 static void check_entry(mfn_t mfn, p2m_type_t new, p2m_type_t old,
                         unsigned int order)
@@ -579,7 +589,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -642,6 +652,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
@@ -684,9 +695,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
         old_mfn = l1e_get_pfn(*p2m_entry);
+        flags = l1e_get_flags(*p2m_entry);
         iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
-                                _mfn(old_mfn));
+            p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                p2m_flags_to_access(flags), _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -714,6 +726,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
                 old_mfn = l1e_get_pfn(*p2m_entry);
                 iommu_old_flags =
                     p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        p2m_flags_to_access(flags),
                                         _mfn(old_mfn));
             }
             else
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -915,7 +915,8 @@ static inline void p2m_altp2m_check(stru
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt,
+                                               p2m_access_t p2ma, mfn_t mfn)
 {
     unsigned int flags;
 

[-- Attachment #50: xsa378/xsa378-4.15-4.patch --]
[-- Type: application/octet-stream, Size: 12548 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1365,7 +1365,7 @@ int set_identity_p2m_entry(struct domain
             return 0;
         return iommu_legacy_map(d, _dfn(gfn_l), _mfn(gfn_l),
                                 1ul << PAGE_ORDER_4K,
-                                IOMMUF_readable | IOMMUF_writable);
+                                p2m_access_to_iommu_flags(p2ma));
     }
 
     gfn_lock(p2m, gfn, 0);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,12 +42,6 @@
 #include "vtd.h"
 #include "../ats.h"
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1311,7 +1305,6 @@ static int intel_iommu_domain_init(struc
     struct domain_iommu *hd = dom_iommu(d);
 
     hd->arch.vtd.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
-    INIT_LIST_HEAD(&hd->arch.vtd.mapped_rmrrs);
 
     return 0;
 }
@@ -1788,17 +1781,12 @@ static void iommu_clear_root_pgtable(str
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
     const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.vtd.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
@@ -1946,74 +1934,6 @@ static int __init vtd_ept_page_compatibl
            (ept_has_1gb(ept_cap) && opt_hap_1gb) <= cap_sps_1gb(vtd_cap);
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.vtd.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.vtd.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2045,7 +1965,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2090,7 +2012,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2289,7 +2212,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2460,7 +2384,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2556,7 +2482,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -143,6 +143,7 @@ int arch_iommu_domain_init(struct domain
 
     INIT_PAGE_LIST_HEAD(&hd->arch.pgtables.list);
     spin_lock_init(&hd->arch.pgtables.lock);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -158,6 +159,99 @@ void arch_iommu_domain_destroy(struct do
            page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
                                          unsigned long pfn,
                                          unsigned long max_pfn)
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/apicdef.h>
 #include <asm/processor.h>
@@ -51,13 +52,14 @@ struct arch_iommu
         spinlock_t lock;
     } pgtables;
 
+    struct list_head identity_maps;
+
     union {
         /* Intel VT-d */
         struct {
             uint64_t pgd_maddr; /* io page directory machine address */
             unsigned int agaw; /* adjusted guest address width, 0 is level 2 30-bit */
             uint64_t iommu_bitmap; /* bitmap of iommu(s) that the domain uses */
-            struct list_head mapped_rmrrs;
         } vtd;
         /* AMD IOMMU */
         struct {
@@ -123,6 +125,11 @@ static inline void iommu_disable_x2apic(
         iommu_ops.disable_x2apic();
 }
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -912,6 +912,34 @@ struct p2m_domain *p2m_get_altp2m(struct
 static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx) {}
 #endif
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -933,9 +961,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #51: xsa378/xsa378-4.15-5.patch --]
[-- Type: application/octet-stream, Size: 7371 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -232,8 +232,10 @@ int __must_check amd_iommu_unmap_page(st
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned long page_count,
                                              unsigned int flush_flags);
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -419,38 +419,49 @@ int amd_iommu_flush_iotlb_all(struct dom
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    unsigned int flush_flags = 0;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        unsigned long frame = gfn + i;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
 
-        rt = amd_iommu_map_page(domain, _dfn(frame), _mfn(frame), flags,
-                                &flush_flags);
-        if ( rt != 0 )
-            break;
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
 
-    /* Use while-break to avoid compiler warning */
-    while ( flush_flags &&
-            amd_iommu_flush_iotlb_pages(domain, _dfn(gfn),
-                                        npages, flush_flags) )
-        break;
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
 
-    return rt;
+    return rc;
 }
 
 int __init amd_iommu_quarantine_init(struct domain *d)
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -329,6 +329,7 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -343,10 +344,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
+    {
+        list_move(&pdev->domain_list, &dom_io->pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = allocate_domain_resources(target);
@@ -357,6 +372,12 @@ static int reassign_device(struct domain
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -367,20 +388,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
+
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
+                   d, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void amd_iommu_clear_root_pgtable(struct domain *d)
@@ -394,6 +423,7 @@ static void amd_iommu_clear_root_pgtable
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     ASSERT(!dom_iommu(d)->arch.amd.root_table);
 }
 

[-- Attachment #52: xsa378/xsa378-4.15-6.patch --]
[-- Type: application/octet-stream, Size: 15584 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -304,6 +304,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -117,12 +117,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -151,14 +147,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -186,55 +186,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -248,27 +245,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
@@ -276,53 +265,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -333,12 +311,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -360,15 +338,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -383,14 +361,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -407,13 +385,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -424,20 +400,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -234,6 +234,8 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -245,11 +247,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.amd.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }

[-- Attachment #53: xsa378/xsa378-4.15-7.patch --]
[-- Type: application/octet-stream, Size: 3364 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -811,7 +811,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn_x(mfn) + i, INVALID_M2P_ENTRY);
         }
     }
@@ -941,13 +941,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -1041,8 +1041,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1289,11 +1288,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -149,6 +149,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -177,6 +181,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #54: xsa378/xsa378-4.15-8.patch --]
[-- Type: application/octet-stream, Size: 5813 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -799,7 +799,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(mfn) || !mfn_eq(mfn_add(mfn, i), mfn_return)) )
+             (!mfn_valid(mfn) || t == p2m_mmio_direct ||
+              !mfn_eq(mfn_add(mfn, i), mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) -
@@ -899,7 +900,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -943,7 +944,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1399,8 +1400,8 @@ int set_identity_p2m_entry(struct domain
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
@@ -2731,7 +2732,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = get_gfn(d, gfn_x(gpfn), &p2mt);
-    if ( mfn_valid(prev_mfn) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(prev_mfn) )
     {
         if ( is_special_page(mfn_to_page(prev_mfn)) )
             /* Special pages are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1299,17 +1299,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -330,7 +330,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1875,6 +1875,15 @@ int check_get_page_from_gfn(struct domai
         return -EAGAIN;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -151,7 +151,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -666,8 +667,6 @@ int p2m_is_logdirty_range(struct p2m_dom
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, gfn_t gfn, mfn_t mfn,
                        unsigned int order);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

[-- Attachment #55: xsa378/xsa378-4.patch --]
[-- Type: application/octet-stream, Size: 13196 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
TBD: In this context (the latest) I question whether it is a good idea
     for clear_identity_p2m_entry() to return success when it didn't
     clear the entry because of mismatching type or MFN.
TBD: Question is whether p2m_get_iommu_flags() should apply the access
     restriction in all cases, not just for p2m_mmio_direct.
TBD: Neither possible placement of p2m_access_rx2rw and p2m_access_n2rwx
     in p2m_access_to_iommu_flags() looks entirely correct/sensible.
---
v9: Re-base.
v7: s/unity_map/identity_map/g. Re-base.
v6: Re-base.
v5: New.

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1427,7 +1427,7 @@ int set_identity_p2m_entry(struct domain
             return 0;
         return iommu_legacy_map(d, _dfn(gfn_l), _mfn(gfn_l),
                                 1ul << PAGE_ORDER_4K,
-                                IOMMUF_readable | IOMMUF_writable);
+                                p2m_access_to_iommu_flags(p2ma));
 #ifdef CONFIG_HVM
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -45,12 +45,6 @@
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
 
-struct mapped_rmrr {
-    struct list_head list;
-    u64 base, end;
-    unsigned int count;
-};
-
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -1306,7 +1300,6 @@ static int intel_iommu_domain_init(struc
     struct domain_iommu *hd = dom_iommu(d);
 
     hd->arch.vtd.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
-    INIT_LIST_HEAD(&hd->arch.vtd.mapped_rmrrs);
 
     return 0;
 }
@@ -1785,17 +1778,12 @@ static void iommu_clear_root_pgtable(str
 static void iommu_domain_teardown(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct mapped_rmrr *mrmrr, *tmp;
     const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.vtd.mapped_rmrrs, list )
-    {
-        list_del(&mrmrr->list);
-        xfree(mrmrr);
-    }
+    iommu_identity_map_teardown(d);
 
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
@@ -1943,74 +1931,6 @@ static int __init vtd_ept_page_compatibl
            (ept_has_1gb(ept_cap) && opt_hap_1gb) <= cap_sps_1gb(vtd_cap);
 }
 
-static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr,
-                                 u32 flag)
-{
-    unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
-    unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
-    struct mapped_rmrr *mrmrr;
-    struct domain_iommu *hd = dom_iommu(d);
-
-    ASSERT(pcidevs_locked());
-    ASSERT(rmrr->base_address < rmrr->end_address);
-
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
-    list_for_each_entry( mrmrr, &hd->arch.vtd.mapped_rmrrs, list )
-    {
-        if ( mrmrr->base == rmrr->base_address &&
-             mrmrr->end == rmrr->end_address )
-        {
-            int ret = 0;
-
-            if ( map )
-            {
-                ++mrmrr->count;
-                return 0;
-            }
-
-            if ( --mrmrr->count )
-                return 0;
-
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
-
-            list_del(&mrmrr->list);
-            xfree(mrmrr);
-            return ret;
-        }
-    }
-
-    if ( !map )
-        return -ENOENT;
-
-    while ( base_pfn < end_pfn )
-    {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
-        base_pfn++;
-    }
-
-    mrmrr = xmalloc(struct mapped_rmrr);
-    if ( !mrmrr )
-        return -ENOMEM;
-    mrmrr->base = rmrr->base_address;
-    mrmrr->end = rmrr->end_address;
-    mrmrr->count = 1;
-    list_add_tail(&mrmrr->list, &hd->arch.vtd.mapped_rmrrs);
-
-    return 0;
-}
-
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
@@ -2042,7 +1962,9 @@ static int intel_iommu_add_device(u8 dev
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0);
+            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -2087,7 +2009,8 @@ static int intel_iommu_remove_device(u8
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        rmrr_identity_mapping(pdev->domain, 0, rmrr, 0);
+        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+                               rmrr->end_address, 0);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2286,7 +2209,8 @@ static void __hwdom_init setup_hwdom_rmr
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = rmrr_identity_mapping(d, 1, rmrr, 0);
+        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                     rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2468,7 +2392,9 @@ static int reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = rmrr_identity_mapping(source, 0, rmrr, 0);
+                ret = iommu_identity_mapping(source, p2m_access_x,
+                                             rmrr->base_address,
+                                             rmrr->end_address, 0);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2564,7 +2490,8 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
+            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+                                         rmrr->end_address, flag);
             if ( ret )
             {
                 int rc;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -144,6 +144,7 @@ int arch_iommu_domain_init(struct domain
 
     INIT_PAGE_LIST_HEAD(&hd->arch.pgtables.list);
     spin_lock_init(&hd->arch.pgtables.lock);
+    INIT_LIST_HEAD(&hd->arch.identity_maps);
 
     return 0;
 }
@@ -159,6 +160,99 @@ void arch_iommu_domain_destroy(struct do
            page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
+struct identity_map {
+    struct list_head list;
+    paddr_t base, end;
+    p2m_access_t access;
+    unsigned int count;
+};
+
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag)
+{
+    unsigned long base_pfn = base >> PAGE_SHIFT_4K;
+    unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
+    struct identity_map *map;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    ASSERT(pcidevs_locked());
+    ASSERT(base < end);
+
+    /*
+     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
+     * get done while holding pcidevs_lock.
+     */
+    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    {
+        if ( map->base == base && map->end == end )
+        {
+            int ret = 0;
+
+            if ( p2ma != p2m_access_x )
+            {
+                if ( map->access != p2ma )
+                    return -EADDRINUSE;
+                ++map->count;
+                return 0;
+            }
+
+            if ( --map->count )
+                return 0;
+
+            while ( base_pfn < end_pfn )
+            {
+                if ( clear_identity_p2m_entry(d, base_pfn) )
+                    ret = -ENXIO;
+                base_pfn++;
+            }
+
+            list_del(&map->list);
+            xfree(map);
+
+            return ret;
+        }
+
+        if ( end >= map->base && map->end >= base )
+            return -EADDRINUSE;
+    }
+
+    if ( p2ma == p2m_access_x )
+        return -ENOENT;
+
+    while ( base_pfn < end_pfn )
+    {
+        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
+
+        if ( err )
+            return err;
+        base_pfn++;
+    }
+
+    map = xmalloc(struct identity_map);
+    if ( !map )
+        return -ENOMEM;
+    map->base = base;
+    map->end = end;
+    map->access = p2ma;
+    map->count = 1;
+    list_add_tail(&map->list, &hd->arch.identity_maps);
+
+    return 0;
+}
+
+void iommu_identity_map_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    struct identity_map *map, *tmp;
+
+    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    {
+        list_del(&map->list);
+        xfree(map);
+    }
+}
+
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
                                          unsigned long pfn,
                                          unsigned long max_pfn)
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -16,6 +16,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/apicdef.h>
 #include <asm/processor.h>
@@ -50,13 +51,14 @@ struct arch_iommu
         spinlock_t lock;
     } pgtables;
 
+    struct list_head identity_maps;
+
     union {
         /* Intel VT-d */
         struct {
             uint64_t pgd_maddr; /* io page directory machine address */
             unsigned int agaw; /* adjusted guest address width, 0 is level 2 30-bit */
             uint64_t iommu_bitmap; /* bitmap of iommu(s) that the domain uses */
-            struct list_head mapped_rmrrs;
         } vtd;
         /* AMD IOMMU */
         struct {
@@ -122,6 +124,11 @@ static inline void iommu_disable_x2apic(
         iommu_ops.disable_x2apic();
 }
 
+int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
+                           paddr_t base, paddr_t end,
+                           unsigned int flag);
+void iommu_identity_map_teardown(struct domain *d);
+
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -888,6 +888,34 @@ struct p2m_domain *p2m_get_altp2m(struct
 static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx) {}
 #endif
 
+/* p2m access to IOMMU flags */
+static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma)
+{
+    switch ( p2ma )
+    {
+    case p2m_access_rw:
+    case p2m_access_rwx:
+        return IOMMUF_readable | IOMMUF_writable;
+
+    case p2m_access_r:
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        return IOMMUF_readable;
+
+    case p2m_access_w:
+    case p2m_access_wx:
+        return IOMMUF_writable;
+
+    case p2m_access_n:
+    case p2m_access_x:
+    case p2m_access_n2rwx:
+        return 0;
+    }
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
 /*
  * p2m type to IOMMU flags
  */
@@ -909,9 +937,10 @@ static inline unsigned int p2m_get_iommu
         flags = IOMMUF_readable;
         break;
     case p2m_mmio_direct:
-        flags = IOMMUF_readable;
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
-            flags |= IOMMUF_writable;
+        flags = p2m_access_to_iommu_flags(p2ma);
+        if ( (flags & IOMMUF_writable) &&
+             rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags &= ~IOMMUF_writable;
         break;
     default:
         flags = 0;

[-- Attachment #56: xsa378/xsa378-5.patch --]
[-- Type: application/octet-stream, Size: 7717 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v9: Re-base.
v7: Re-base.
v6: Re-base.
v5: Switch to using iommu_identity_mapping(). Also unmap() when any
    map() fails in amd_iommu_assign_device(), unless it's Dom0. Re-base.
v4: Use daddr_to_dfn(). Drop stray blank line. Don't ever unmap from
    Dom0.
v3: Use DomIO instead of DomXEN.
v2: Re-base over XSA-302.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -232,8 +232,10 @@ int __must_check amd_iommu_unmap_page(st
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr, unsigned long size,
-                                       int iw, int ir);
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag);
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map);
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned long page_count,
                                              unsigned int flush_flags);
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -419,38 +419,49 @@ int amd_iommu_flush_iotlb_all(struct dom
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *domain,
-                                       paddr_t phys_addr,
-                                       unsigned long size, int iw, int ir)
+int amd_iommu_reserve_domain_unity_map(struct domain *d,
+                                       const struct ivrs_unity_map *map,
+                                       unsigned int flag)
 {
-    unsigned long npages, i;
-    unsigned long gfn;
-    unsigned int flags = !!ir;
-    unsigned int flush_flags = 0;
-    int rt = 0;
-
-    if ( iw )
-        flags |= IOMMUF_writable;
-
-    npages = region_to_pages(phys_addr, size);
-    gfn = phys_addr >> PAGE_SHIFT;
-    for ( i = 0; i < npages; i++ )
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; !rc && map; map = map->next )
     {
-        unsigned long frame = gfn + i;
+        p2m_access_t p2ma = p2m_access_n;
+
+        if ( map->read )
+            p2ma |= p2m_access_r;
+        if ( map->write )
+            p2ma |= p2m_access_w;
 
-        rt = amd_iommu_map_page(domain, _dfn(frame), _mfn(frame), flags,
-                                &flush_flags);
-        if ( rt != 0 )
-            break;
+        rc = iommu_identity_mapping(d, p2ma, map->addr,
+                                    map->addr + map->length - 1, flag);
     }
 
-    /* Use while-break to avoid compiler warning */
-    while ( flush_flags &&
-            amd_iommu_flush_iotlb_pages(domain, _dfn(gfn),
-                                        npages, flush_flags) )
-        break;
+    return rc;
+}
+
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+                                         const struct ivrs_unity_map *map)
+{
+    int rc;
+
+    if ( d == dom_io )
+        return 0;
+
+    for ( rc = 0; map; map = map->next )
+    {
+        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+                                         map->addr + map->length - 1, 0);
+
+        if ( ret && ret != -ENOENT && !rc )
+            rc = ret;
+    }
 
-    return rt;
+    return rc;
 }
 
 int __init amd_iommu_quarantine_init(struct domain *d)
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -347,6 +347,7 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -361,10 +362,24 @@ static int reassign_device(struct domain
 
     amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
 
-    if ( devfn == pdev->devfn )
+    /*
+     * If the device belongs to the hardware domain, and it has a unity mapping,
+     * don't remove it from the hardware domain, because BIOS may reference that
+     * mapping.
+     */
+    if ( !is_hardware_domain(source) )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
+        rc = amd_iommu_reserve_domain_unity_unmap(
+                 source,
+                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
+        if ( rc )
+            return rc;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != dom_io )
+    {
+        list_move(&pdev->domain_list, &dom_io->pdev_list);
+        pdev->domain = dom_io;
     }
 
     rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
@@ -374,6 +389,12 @@ static int reassign_device(struct domain
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
     return 0;
 }
 
@@ -384,20 +405,28 @@ static int amd_iommu_assign_device(struc
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    const struct ivrs_unity_map *unity_map;
+    int rc = amd_iommu_reserve_domain_unity_map(
+                 d, ivrs_mappings[req_id].unity_map, flag);
 
-    for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map;
-          unity_map = unity_map->next )
+    if ( !rc )
+        rc = reassign_device(pdev->domain, d, devfn, pdev);
+
+    if ( rc && !is_hardware_domain(d) )
     {
-        int rc = amd_iommu_reserve_domain_unity_map(
-                     d, unity_map->addr, unity_map->length,
-                     unity_map->write, unity_map->read);
+        int ret = amd_iommu_reserve_domain_unity_unmap(
+                      d, ivrs_mappings[req_id].unity_map);
 
-        if ( rc )
-            return rc;
+        if ( ret )
+        {
+            printk(XENLOG_ERR "AMD-Vi: "
+                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
+                   d, pdev->seg, pdev->bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
+            domain_crash(d);
+        }
     }
 
-    return reassign_device(pdev->domain, d, devfn, pdev);
+    return rc;
 }
 
 static void amd_iommu_clear_root_pgtable(struct domain *d)
@@ -411,6 +440,7 @@ static void amd_iommu_clear_root_pgtable
 
 static void amd_iommu_domain_destroy(struct domain *d)
 {
+    iommu_identity_map_teardown(d);
     ASSERT(!dom_iommu(d)->arch.amd.root_table);
 }
 

[-- Attachment #57: xsa378/xsa378-6.patch --]
[-- Type: application/octet-stream, Size: 16239 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
TBD: The adjustment done to amd_iommu_domain_init() would suggest that
     it might be worth considering to use the exclusion range register
     for the highest reported (and suitable) range, rather than the
     first one found (however unlikely it might be to find multiple
     ones in the first place). That way we might be able to limit the
     page table tree height.
---
v9: Re-base.
v8: Re-base over adjustments to earlier patch.
v7: Don't use exclusion range when ACPI table entry doesn't say so.
v6: Re-base.
v5: New.
---
Backporting note: This depends on b75b3c62fe4a ("AMD/IOMMU: fix
off-by-one in amd_iommu_get_paging_mode() callers").

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -304,6 +304,8 @@ extern struct hpet_sbdf {
     } init;
 } hpet_sbdf;
 
+extern int amd_iommu_min_paging_mode;
+
 extern void *shared_intremap_table;
 extern unsigned long *shared_intremap_inuse;
 
--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -117,12 +117,8 @@ static struct amd_iommu * __init find_io
 }
 
 static int __init reserve_iommu_exclusion_range(
-    struct amd_iommu *iommu, uint64_t base, uint64_t limit,
-    bool all, bool iw, bool ir)
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all)
 {
-    if ( !ir || !iw )
-        return -EPERM;
-
     /* need to extend exclusion range? */
     if ( iommu->exclusion_enable )
     {
@@ -151,14 +147,18 @@ static int __init reserve_unity_map_for_
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
     struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map;
+    int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length));
+
+    if ( paging_mode < 0 )
+        return paging_mode;
 
     /* Check for overlaps. */
     for ( ; unity_map; unity_map = unity_map->next )
     {
         /*
          * Exact matches are okay. This can in particular happen when
-         * register_exclusion_range_for_device() calls here twice for the
-         * same (s,b,d,f).
+         * register_range_for_device() calls here twice for the same
+         * (s,b,d,f).
          */
         if ( base == unity_map->addr && length == unity_map->length &&
              ir == unity_map->read && iw == unity_map->write )
@@ -186,55 +186,52 @@ static int __init reserve_unity_map_for_
     unity_map->next = ivrs_mappings[bdf].unity_map;
     ivrs_mappings[bdf].unity_map = unity_map;
 
+    if ( paging_mode > amd_iommu_min_paging_mode )
+        amd_iommu_min_paging_mode = paging_mode;
+
     return 0;
 }
 
-static int __init register_exclusion_range_for_all_devices(
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_all_devices(
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
-    unsigned int bdf;
     int rc = 0;
 
     /* is part of exclusion range inside of IOMMU virtual address space? */
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
-    }
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
+    if ( exclusion )
     {
         for_each_amd_iommu( iommu )
         {
-            rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                               true /* all */, iw, ir);
-            if ( rc )
-                break;
+            int ret = reserve_iommu_exclusion_range(iommu, base, limit,
+                                                    true /* all */);
+
+            if ( ret && !rc )
+                rc = ret;
         }
     }
 
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+        unsigned int bdf;
+
+        /* reserve r/w unity-mapped page entries for devices */
+        for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+            rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir);
+    }
+
     return rc;
 }
 
-static int __init register_exclusion_range_for_device(
-    u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_device(
+    unsigned int bdf, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg);
-    unsigned long range_top, iommu_top, length;
     struct amd_iommu *iommu;
     u16 req;
     int rc = 0;
@@ -248,27 +245,19 @@ static int __init register_exclusion_ran
     req = ivrs_mappings[bdf].dte_requestor_id;
 
     /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
+    if ( exclusion )
+        rc = reserve_iommu_exclusion_range(iommu, base, limit,
+                                           false /* all */);
+    if ( !exclusion || rc )
+    {
+        paddr_t length = limit + PAGE_SIZE - base;
+
         /* reserve unity-mapped page entries for device */
-        /* note: these entries are part of the exclusion range */
         rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?:
              reserve_unity_map_for_device(seg, req, base, length, iw, ir);
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
     }
-
-    /* register IOMMU exclusion range settings for device */
-    if ( !rc && limit >= iommu_top  )
+    else
     {
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           false /* all */, iw, ir);
         ivrs_mappings[bdf].dte_allow_exclusion = true;
         ivrs_mappings[req].dte_allow_exclusion = true;
     }
@@ -276,53 +265,42 @@ static int __init register_exclusion_ran
     return rc;
 }
 
-static int __init register_exclusion_range_for_iommu_devices(
-    struct amd_iommu *iommu,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+static int __init register_range_for_iommu_devices(
+    struct amd_iommu *iommu, paddr_t base, paddr_t limit,
+    bool iw, bool ir, bool exclusion)
 {
-    unsigned long range_top, iommu_top, length;
+    /* note: 'limit' parameter is assumed to be page-aligned */
+    paddr_t length = limit + PAGE_SIZE - base;
     unsigned int bdf;
     u16 req;
-    int rc = 0;
+    int rc;
 
-    /* is part of exclusion range inside of IOMMU virtual address space? */
-    /* note: 'limit' parameter is assumed to be page-aligned */
-    range_top = limit + PAGE_SIZE;
-    iommu_top = max_page * PAGE_SIZE;
-    if ( base < iommu_top )
-    {
-        if ( range_top > iommu_top )
-            range_top = iommu_top;
-        length = range_top - base;
-        /* reserve r/w unity-mapped page entries for devices */
-        /* note: these entries are part of the exclusion range */
-        for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
-        {
-            if ( iommu == find_iommu_for_device(iommu->seg, bdf) )
-            {
-                req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
-                rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
-                                                  iw, ir) ?:
-                     reserve_unity_map_for_device(iommu->seg, req, base, length,
-                                                  iw, ir);
-            }
-        }
-
-        /* push 'base' just outside of virtual address space */
-        base = iommu_top;
+    if ( exclusion )
+    {
+        rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */);
+        if ( !rc )
+            return 0;
     }
 
-    /* register IOMMU exclusion range settings */
-    if ( !rc && limit >= iommu_top )
-        rc = reserve_iommu_exclusion_range(iommu, base, limit,
-                                           true /* all */, iw, ir);
+    /* reserve unity-mapped page entries for devices */
+    for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ )
+    {
+        if ( iommu != find_iommu_for_device(iommu->seg, bdf) )
+            continue;
+
+        req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id;
+        rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length,
+                                          iw, ir) ?:
+             reserve_unity_map_for_device(iommu->seg, req, base, length,
+                                          iw, ir);
+    }
 
     return rc;
 }
 
 static int __init parse_ivmd_device_select(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     u16 bdf;
 
@@ -333,12 +311,12 @@ static int __init parse_ivmd_device_sele
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_device(bdf, base, limit, iw, ir);
+    return register_range_for_device(bdf, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_device_range(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     unsigned int first_bdf, last_bdf, bdf;
     int error;
@@ -360,15 +338,15 @@ static int __init parse_ivmd_device_rang
     }
 
     for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ )
-        error = register_exclusion_range_for_device(
-            bdf, base, limit, iw, ir);
+        error = register_range_for_device(
+            bdf, base, limit, iw, ir, exclusion);
 
     return error;
 }
 
 static int __init parse_ivmd_device_iommu(
     const struct acpi_ivrs_memory *ivmd_block,
-    unsigned long base, unsigned long limit, u8 iw, u8 ir)
+    paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion)
 {
     int seg = 0; /* XXX */
     struct amd_iommu *iommu;
@@ -383,14 +361,14 @@ static int __init parse_ivmd_device_iomm
         return -ENODEV;
     }
 
-    return register_exclusion_range_for_iommu_devices(
-        iommu, base, limit, iw, ir);
+    return register_range_for_iommu_devices(
+        iommu, base, limit, iw, ir, exclusion);
 }
 
 static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block)
 {
     unsigned long start_addr, mem_length, base, limit;
-    u8 iw, ir;
+    bool iw = true, ir = true, exclusion = false;
 
     if ( ivmd_block->header.length < sizeof(*ivmd_block) )
     {
@@ -407,13 +385,11 @@ static int __init parse_ivmd_block(const
                     ivmd_block->header.type, start_addr, mem_length);
 
     if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE )
-        iw = ir = IOMMU_CONTROL_ENABLED;
+        exclusion = true;
     else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY )
     {
-        iw = ivmd_block->header.flags & ACPI_IVMD_READ ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
-        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ?
-            IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED;
+        iw = ivmd_block->header.flags & ACPI_IVMD_READ;
+        ir = ivmd_block->header.flags & ACPI_IVMD_WRITE;
     }
     else
     {
@@ -424,20 +400,20 @@ static int __init parse_ivmd_block(const
     switch( ivmd_block->header.type )
     {
     case ACPI_IVRS_TYPE_MEMORY_ALL:
-        return register_exclusion_range_for_all_devices(
-            base, limit, iw, ir);
+        return register_range_for_all_devices(
+            base, limit, iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_ONE:
-        return parse_ivmd_device_select(ivmd_block,
-                                        base, limit, iw, ir);
+        return parse_ivmd_device_select(ivmd_block, base, limit,
+                                        iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_RANGE:
-        return parse_ivmd_device_range(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_range(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     case ACPI_IVRS_TYPE_MEMORY_IOMMU:
-        return parse_ivmd_device_iommu(ivmd_block,
-                                       base, limit, iw, ir);
+        return parse_ivmd_device_iommu(ivmd_block, base, limit,
+                                       iw, ir, exclusion);
 
     default:
         AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n");
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -246,6 +246,8 @@ int amd_iommu_alloc_root(struct domain *
     return 0;
 }
 
+int __read_mostly amd_iommu_min_paging_mode = 1;
+
 static int amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -257,11 +259,13 @@ static int amd_iommu_domain_init(struct
      * - HVM could in principle use 3 or 4 depending on how much guest
      *   physical address space we give it, but this isn't known yet so use 4
      *   unilaterally.
+     * - Unity maps may require an even higher number.
      */
-    hd->arch.amd.paging_mode = amd_iommu_get_paging_mode(
-        is_hvm_domain(d)
-        ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-        : get_upper_mfn_bound() + 1);
+    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
+            is_hvm_domain(d)
+            ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
+            : get_upper_mfn_bound() + 1),
+        amd_iommu_min_paging_mode);
 
     return 0;
 }

[-- Attachment #58: xsa378/xsa378-7.patch --]
[-- Type: application/octet-stream, Size: 3406 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v8: New, split from subsequent patch.

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -817,7 +817,7 @@ p2m_remove_page(struct p2m_domain *p2m,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            if ( !p2m_is_special(t) && !p2m_is_shared(t) )
                 set_gpfn_from_mfn(mfn_x(mfn) + i, INVALID_M2P_ENTRY);
         }
     }
@@ -954,13 +954,13 @@ guest_physmap_add_entry(struct domain *d
                                   &ot, &a, 0, NULL, NULL);
             ASSERT(!p2m_is_shared(ot));
         }
-        if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+        if ( p2m_is_special(ot) )
         {
-            /* Really shouldn't be unmapping grant/foreign maps this way */
+            /* Don't permit unmapping grant/foreign this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
-            return -EINVAL;
+            return -EPERM;
         }
         else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) )
         {
@@ -1053,8 +1053,7 @@ int p2m_change_type_one(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
-    BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
-    BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt));
+    BUG_ON(p2m_is_special(ot) || p2m_is_special(nt));
 
     gfn_lock(p2m, gfn, 0);
 
@@ -1300,11 +1299,11 @@ static int set_typed_p2m_entry(struct do
         gfn_unlock(p2m, gfn, order);
         return cur_order + 1;
     }
-    if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
+    if ( p2m_is_special(ot) )
     {
         gfn_unlock(p2m, gfn, order);
         domain_crash(d);
-        return -ENOENT;
+        return -EPERM;
     }
     else if ( p2m_is_ram(ot) )
     {
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -154,6 +154,10 @@ typedef unsigned int p2m_query_t;
                             | p2m_to_mask(p2m_ram_logdirty) )
 #define P2M_SHARED_TYPES   (p2m_to_mask(p2m_ram_shared))
 
+/* Types established/cleaned up via special accessors. */
+#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
+                           p2m_to_mask(p2m_map_foreign))
+
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
                                | p2m_to_mask(p2m_mmio_direct) \
@@ -182,6 +186,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_paged(_t)    (p2m_to_mask(_t) & P2M_PAGED_TYPES)
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
+#define p2m_is_special(_t)  (p2m_to_mask(_t) & P2M_SPECIAL_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 #define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 

[-- Attachment #59: xsa378/xsa378-8.patch --]
[-- Type: application/octet-stream, Size: 6656 bytes --]

From: Jan Beulich <jbeulich@suse.com>
Subject: x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v8: Split off introduction of p2m_is_special().
v7: Re-base.
v6: Also adjust xenmem_add_to_physmap_one(). Fully drop p2m_mmio_direct
    special handling from guest_remove_page(); just return an error.
    Make clear_mmio_p2m_entry() static.
v5: New.
---
Backporting note: While likely noticable by patch conflicts anyway, this
depends on one or both of a6b051a87a58 ("x86/p2m: don't ignore
p2m_remove_page()'s return value") and c65ea16dbcaf ("x86/p2m: don't
assert that the passed in MFN matches for a remove").

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -805,7 +805,8 @@ p2m_remove_page(struct p2m_domain *p2m,
                                           &cur_order, NULL);
 
         if ( p2m_is_valid(t) &&
-             (!mfn_valid(mfn) || !mfn_eq(mfn_add(mfn, i), mfn_return)) )
+             (!mfn_valid(mfn) || t == p2m_mmio_direct ||
+              !mfn_eq(mfn_add(mfn, i), mfn_return)) )
             return -EILSEQ;
 
         i += (1UL << cur_order) -
@@ -912,7 +913,7 @@ guest_physmap_add_entry(struct domain *d
     if ( p2m_is_foreign(t) )
         return -EINVAL;
 
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) || t == p2m_mmio_direct )
     {
         ASSERT_UNREACHABLE();
         return -EINVAL;
@@ -956,7 +957,7 @@ guest_physmap_add_entry(struct domain *d
         }
         if ( p2m_is_special(ot) )
         {
-            /* Don't permit unmapping grant/foreign this way. */
+            /* Don't permit unmapping grant/foreign/direct-MMIO this way. */
             domain_crash(d);
             p2m_unlock(p2m);
             
@@ -1364,8 +1365,8 @@ int set_mmio_p2m_entry(struct domain *d,
  *    order+1  for caller to retry with order (guaranteed smaller than
  *             the order value passed in)
  */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn,
-                         unsigned int order)
+static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l,
+                                mfn_t mfn, unsigned int order)
 {
     int rc = -EINVAL;
     gfn_t gfn = _gfn(gfn_l);
@@ -2766,7 +2767,9 @@ int xenmem_add_to_physmap_one(
 
     /* Remove previously mapped page if it was present. */
     prev_mfn = get_gfn(d, gfn_x(gpfn), &p2mt);
-    if ( mfn_valid(prev_mfn) )
+    if ( p2mt == p2m_mmio_direct )
+        rc = -EPERM;
+    else if ( mfn_valid(prev_mfn) )
     {
         if ( is_special_page(mfn_to_page(prev_mfn)) )
             /* Special pages are simply unhooked from this phys slot. */
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -1299,17 +1299,17 @@ guest_physmap_mark_populate_on_demand(st
 
         p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL);
         n = 1UL << min(order, cur_order);
-        if ( p2m_is_ram(ot) )
+        if ( ot == p2m_populate_on_demand )
+        {
+            /* Count how many PoD entries we'll be replacing if successful */
+            pod_count += n;
+        }
+        else if ( ot != p2m_invalid && ot != p2m_mmio_dm )
         {
             P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot);
             rc = -EBUSY;
             goto out;
         }
-        else if ( ot == p2m_populate_on_demand )
-        {
-            /* Count how man PoD entries we'll be replacing if successful */
-            pod_count += n;
-        }
     }
 
     /* Now, actually do the two-way mapping */
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -332,7 +332,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K);
+        rc = -EPERM;
         goto out_put_gfn;
     }
 #else
@@ -1888,6 +1888,15 @@ int check_get_page_from_gfn(struct domai
         return -EAGAIN;
     }
 #endif
+#ifdef CONFIG_X86
+    if ( p2mt == p2m_mmio_direct )
+    {
+        if ( page )
+            put_page(page);
+
+        return -EPERM;
+    }
+#endif
 
     if ( !page )
         return -EINVAL;
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -156,7 +156,8 @@ typedef unsigned int p2m_query_t;
 
 /* Types established/cleaned up via special accessors. */
 #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \
-                           p2m_to_mask(p2m_map_foreign))
+                           p2m_to_mask(p2m_map_foreign) | \
+                           p2m_to_mask(p2m_mmio_direct))
 
 /* Valid types not necessarily associated with a (valid) MFN. */
 #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES                  \
@@ -627,19 +628,9 @@ int p2m_finish_type_change(struct domain
 int p2m_is_logdirty_range(struct p2m_domain *, unsigned long start,
                           unsigned long end);
 
-#ifdef CONFIG_HVM
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, gfn_t gfn, mfn_t mfn,
                        unsigned int order);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                         unsigned int order);
-#else
-static inline int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn,
-                                       mfn_t mfn, unsigned int order)
-{
-    return -EIO;
-}
-#endif
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

^ permalink raw reply	[flat|nested] 4+ messages in thread