All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
@ 2020-03-09 11:09 Jan Beulich
  2020-03-10  3:43 ` Tian, Kevin
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jan Beulich @ 2020-03-09 11:09 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Paul Durrant

Containing still in flight DMA was introduced to work around certain
devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
Passing through (such) devices (on such systems) is inherently insecure
(as guests could easily arrange for IOMMU faults of any kind to occur).
Defaulting to a mode where admins may not even become aware of issues
with devices can be considered undesirable. Therefore convert this mode
of operation to an optional one, not one enabled by default.

This involves resurrecting code commit ea38867831da ("x86 / iommu: set
up a scratch page in the quarantine domain") did remove, in a slightly
extended and abstracted fashion. Here, instead of reintroducing a pretty
pointless use of "goto" in domain_context_unmap(), and instead of making
the function (at least temporarily) inconsistent, take the opportunity
and replace the other similarly pointless "goto" as well.

In order to key the re-instated bypasses off of there (not) being a root
page table this further requires moving the allocate_domain_resources()
invocation from reassign_device() to amd_iommu_setup_domain_device() (or
else reassign_device() would allocate a root page table anyway); this is
benign to the second caller of the latter function.

Take the opportunity and also limit the control to builds supporting
PCI.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm happy to take better suggestions to replace the "full" command line
option and Kconfig prompt tokens. I don't think though that "fault" and
"write-fault" are really suitable there.
---
This patch contextually depends on "[PATCH v2 0/5] IOMMU: restrict
visibility/scope if certain variables".
---
v3: IOMMU_quarantine_basic -> IOMMU_quarantine_fault,
    IOMMU_quarantine_full -> IOMMU_quarantine_write_fault. Kconfig
    option (choice) to select default. Limit to HAS_PCI.
v2: Don't use true/false. Introduce QUARANTINE_SKIP() (albeit I'm not
    really convinced this is an improvement). Add comment.

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1238,7 +1238,7 @@ detection of systems known to misbehave
 > Default: `new` unless directed-EOI is supported
 
 ### iommu
-    = List of [ <bool>, verbose, debug, force, required, quarantine,
+    = List of [ <bool>, verbose, debug, force, required, quarantine[=full],
                 sharept, intremap, intpost, crash-disable,
                 snoop, qinval, igfx, amd-iommu-perdev-intremap,
                 dom0-{passthrough,strict} ]
@@ -1276,11 +1276,15 @@ boolean (e.g. `iommu=no`) can override t
     will prevent Xen from booting if IOMMUs aren't discovered and enabled
     successfully.
 
-*   The `quarantine` boolean can be used to control Xen's behavior when
-    de-assigning devices from guests.  If enabled (the default), Xen always
-    quarantines such devices; they must be explicitly assigned back to Dom0
-    before they can be used there again.  If disabled, Xen will only
-    quarantine devices the toolstack hass arranged for getting quarantined.
+*   The `quarantine` option can be used to control Xen's behavior when
+    de-assigning devices from guests.  If set to true (the default), Xen
+    always quarantines such devices; they must be explicitly assigned back
+    to Dom0 before they can be used there again.  If set to "full", still
+    active DMA will additionally be directed to a "sink" page.  If set to
+    false, Xen will only quarantine devices the toolstack has arranged for
+    getting quarantined.
+
+    This option is only valid on builds supporting PCI.
 
 *   The `sharept` boolean controls whether the IOMMU pagetables are shared
     with the CPU-side HAP pagetables, or allocated separately.  Sharing
--- a/xen/drivers/passthrough/Kconfig
+++ b/xen/drivers/passthrough/Kconfig
@@ -28,3 +28,31 @@ endif
 
 config IOMMU_FORCE_PT_SHARE
 	bool
+
+choice
+	prompt "IOMMU device quarantining default behavior"
+	depends on HAS_PCI
+	default IOMMU_QUARANTINE_BASIC
+	---help---
+	  When a PCI device is assigned to an untrusted domain, it is possible
+	  for that domain to program the device to DMA to an arbitrary address.
+	  The IOMMU is used to protect the host from malicious DMA by making
+	  sure that the device addresses can only target memory assigned to the
+	  guest.  However, when the guest domain is torn down, assigning the
+	  device back to the hardware domain would allow any in-flight DMA to
+	  potentially target critical host data.  To avoid this, quarantining
+	  shold be enabled.  Quarantining can be done in two ways: In its basic
+	  form, all in-flight DMA will simply be forced to encounter IOMMU
+	  faults.  Since there are systems where doing so can cause host
+	  lockup, an alternative form is available where writes to memory will
+	  be made fault, but reads will be directed to a dummy page.  The
+	  implication here is that such reads will go unnoticed, i.e. an admin
+	  may not become aware of the underlying problem.
+
+	config IOMMU_QUARANTINE_NONE
+		bool "none"
+	config IOMMU_QUARANTINE_BASIC
+		bool "basic"
+	config IOMMU_QUARANTINE_FULL
+		bool "full"
+endchoice
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -25,6 +25,9 @@
 #include "iommu.h"
 #include "../ats.h"
 
+/* dom_io is used as a sentinel for quarantined devices */
+#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.root_table)
+
 static bool_t __read_mostly init_done;
 
 static const struct iommu_init_ops _iommu_init_ops;
@@ -82,18 +85,35 @@ int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain_iommu *hd)
+{
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(hd);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    int req_id, valid = 1, rc;
     u8 bus = pdev->bus;
-    const struct domain_iommu *hd = dom_iommu(domain);
+    struct domain_iommu *hd = dom_iommu(domain);
+
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
+    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
-    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
-            !iommu->dev_table.buffer );
+    rc = allocate_domain_resources(hd);
+    if ( rc )
+        return rc;
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
         valid = 0;
@@ -148,6 +168,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -217,17 +239,6 @@ int amd_iommu_alloc_root(struct domain_i
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain_iommu *hd)
-{
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(hd);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int amd_iommu_get_paging_mode(unsigned long entries)
 {
     int level = 1;
@@ -291,6 +302,9 @@ static void amd_iommu_disable_domain_dev
     int req_id;
     u8 bus = pdev->bus;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return;
+
     BUG_ON ( iommu->dev_table.buffer == NULL );
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
@@ -337,7 +351,6 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
-    struct domain_iommu *t = dom_iommu(target);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -358,11 +371,10 @@ static int reassign_device(struct domain
         pdev->domain = target;
     }
 
-    rc = allocate_domain_resources(t);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     if ( rc )
         return rc;
 
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
@@ -519,8 +531,7 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -31,9 +31,24 @@ bool_t __initdata iommu_enable = 1;
 bool_t __read_mostly iommu_enabled;
 bool_t __read_mostly force_iommu;
 bool_t __read_mostly iommu_verbose;
-bool __read_mostly iommu_quarantine = true;
 bool_t __read_mostly iommu_crash_disable;
 
+#define IOMMU_quarantine_none        0 /* aka false */
+#define IOMMU_quarantine_fault       1 /* aka true */
+#define IOMMU_quarantine_write_fault 2
+#ifdef CONFIG_HAS_PCI
+uint8_t __read_mostly iommu_quarantine =
+# if defined(CONFIG_IOMMU_QUARANTINE_NONE)
+    IOMMU_quarantine_none;
+# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC)
+    IOMMU_quarantine_fault;
+# elif defined(CONFIG_IOMMU_QUARANTINE_FULL)
+    IOMMU_quarantine_write_fault;
+# endif
+#else
+# define iommu_quarantine IOMMU_quarantine_none
+#endif /* CONFIG_HAS_PCI */
+
 static bool __hwdom_initdata iommu_hwdom_none;
 bool __hwdom_initdata iommu_hwdom_strict;
 bool __read_mostly iommu_hwdom_passthrough;
@@ -68,8 +83,12 @@ static int __init parse_iommu_param(cons
         else if ( (val = parse_boolean("force", s, ss)) >= 0 ||
                   (val = parse_boolean("required", s, ss)) >= 0 )
             force_iommu = val;
+#ifdef CONFIG_HAS_PCI
         else if ( (val = parse_boolean("quarantine", s, ss)) >= 0 )
             iommu_quarantine = val;
+        else if ( ss == s + 15 && !strncmp(s, "quarantine=full", 15) )
+            iommu_quarantine = IOMMU_quarantine_write_fault;
+#endif
 #ifdef CONFIG_X86
         else if ( (val = parse_boolean("igfx", s, ss)) >= 0 )
             iommu_igfx = val;
@@ -448,7 +467,7 @@ static int __init iommu_quarantine_init(
     dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
     rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
+    if ( rc || iommu_quarantine < IOMMU_quarantine_write_fault )
         return rc;
 
     if ( !hd->platform_ops->quarantine_init )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -41,6 +41,9 @@
 #include "vtd.h"
 #include "../ats.h"
 
+/* dom_io is used as a sentinel for quarantined devices */
+#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.pgd_maddr)
+
 struct mapped_rmrr {
     struct list_head list;
     u64 base, end;
@@ -1294,6 +1297,9 @@ int domain_context_mapping_one(
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
@@ -1548,6 +1554,9 @@ int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
@@ -1609,7 +1618,7 @@ static int domain_context_unmap(struct d
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
-    int ret = 0;
+    int ret;
     u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
     int found = 0;
 
@@ -1625,14 +1634,12 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u unmap\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        if ( !is_hardware_domain(domain) )
-            return -EPERM;
-        goto out;
+        return is_hardware_domain(domain) ? 0 : -EPERM;
 
     case DEV_TYPE_PCIe_BRIDGE:
     case DEV_TYPE_PCIe2PCI_BRIDGE:
     case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        goto out;
+        return 0;
 
     case DEV_TYPE_PCIe_ENDPOINT:
         if ( iommu_debug )
@@ -1676,10 +1683,12 @@ static int domain_context_unmap(struct d
         dprintk(XENLOG_ERR VTDPREFIX, "d%d:unknown(%u): %04x:%02x:%02x.%u\n",
                 domain->domain_id, pdev->type,
                 seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = -EINVAL;
-        goto out;
+        return -EINVAL;
     }
 
+    if ( QUARANTINE_SKIP(domain) )
+        return ret;
+
     /*
      * if no other devices under the same iommu owned by this domain,
      * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
@@ -1705,16 +1714,12 @@ static int domain_context_unmap(struct d
 
         iommu_domid = domain_iommu_domid(domain, iommu);
         if ( iommu_domid == -1 )
-        {
-            ret = -EINVAL;
-            goto out;
-        }
+            return -EINVAL;
 
         clear_bit(iommu_domid, iommu->domid_bitmap);
         iommu->domid_map[iommu_domid] = 0;
     }
 
-out:
     return ret;
 }
 
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -53,7 +53,9 @@ static inline bool_t dfn_eq(dfn_t x, dfn
 }
 
 extern bool_t iommu_enable, iommu_enabled;
-extern bool force_iommu, iommu_quarantine, iommu_verbose;
+extern bool force_iommu, iommu_verbose;
+/* Boolean except for the specific purposes of drivers/passthrough/iommu.c. */
+extern uint8_t iommu_quarantine;
 
 #ifdef CONFIG_X86
 extern enum __packed iommu_intremap {

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-09 11:09 [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional Jan Beulich
@ 2020-03-10  3:43 ` Tian, Kevin
  2020-03-10 10:26   ` Jan Beulich
  2020-03-10  5:30 ` Tian, Kevin
  2020-03-10  8:58 ` Paul Durrant
  2 siblings, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2020-03-10  3:43 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper, Paul Durrant

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 9, 2020 7:09 PM
> 
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
> Passing through (such) devices (on such systems) is inherently insecure
> (as guests could easily arrange for IOMMU faults of any kind to occur).
> Defaulting to a mode where admins may not even become aware of issues
> with devices can be considered undesirable. Therefore convert this mode
> of operation to an optional one, not one enabled by default.
> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device()
> (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Take the opportunity and also limit the control to builds supporting
> PCI.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm happy to take better suggestions to replace the "full" command line
> option and Kconfig prompt tokens. I don't think though that "fault" and
> "write-fault" are really suitable there.

I think we may just allow both r/w access to scratch page for such bogus
device, which may make 'full' more reasonable since we now fully
contain in-fly DMAs. I'm not sure about the value of keeping write-fault
alone for such devices (just because one observed his specific device only 
has problem with read-fault).

alternatively I also thought about whether whitelisting the problematic 
devices through another option (e.g. nofault=b:d:f) could provide more
value. In concept any IOMMU page table (dom0, dom_io or domU) 
for such bogus device should not include invalid entry, even when 
quarantine is not specified. However I'm not sure whether it's worthy of 
going so far...

> ---
> This patch contextually depends on "[PATCH v2 0/5] IOMMU: restrict
> visibility/scope if certain variables".
> ---
> v3: IOMMU_quarantine_basic -> IOMMU_quarantine_fault,
>     IOMMU_quarantine_full -> IOMMU_quarantine_write_fault. Kconfig
>     option (choice) to select default. Limit to HAS_PCI.
> v2: Don't use true/false. Introduce QUARANTINE_SKIP() (albeit I'm not
>     really convinced this is an improvement). Add comment.
> 
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1238,7 +1238,7 @@ detection of systems known to misbehave
>  > Default: `new` unless directed-EOI is supported
> 
>  ### iommu
> -    = List of [ <bool>, verbose, debug, force, required, quarantine,
> +    = List of [ <bool>, verbose, debug, force, required, quarantine[=full],
>                  sharept, intremap, intpost, crash-disable,
>                  snoop, qinval, igfx, amd-iommu-perdev-intremap,
>                  dom0-{passthrough,strict} ]
> @@ -1276,11 +1276,15 @@ boolean (e.g. `iommu=no`) can override t
>      will prevent Xen from booting if IOMMUs aren't discovered and enabled
>      successfully.
> 
> -*   The `quarantine` boolean can be used to control Xen's behavior when
> -    de-assigning devices from guests.  If enabled (the default), Xen always
> -    quarantines such devices; they must be explicitly assigned back to Dom0
> -    before they can be used there again.  If disabled, Xen will only
> -    quarantine devices the toolstack hass arranged for getting quarantined.
> +*   The `quarantine` option can be used to control Xen's behavior when
> +    de-assigning devices from guests.  If set to true (the default), Xen
> +    always quarantines such devices; they must be explicitly assigned back
> +    to Dom0 before they can be used there again.  If set to "full", still
> +    active DMA will additionally be directed to a "sink" page.  If set to
> +    false, Xen will only quarantine devices the toolstack has arranged for
> +    getting quarantined.
> +
> +    This option is only valid on builds supporting PCI.
> 
>  *   The `sharept` boolean controls whether the IOMMU pagetables are
> shared
>      with the CPU-side HAP pagetables, or allocated separately.  Sharing
> --- a/xen/drivers/passthrough/Kconfig
> +++ b/xen/drivers/passthrough/Kconfig
> @@ -28,3 +28,31 @@ endif
> 
>  config IOMMU_FORCE_PT_SHARE
>  	bool
> +
> +choice
> +	prompt "IOMMU device quarantining default behavior"
> +	depends on HAS_PCI
> +	default IOMMU_QUARANTINE_BASIC
> +	---help---
> +	  When a PCI device is assigned to an untrusted domain, it is possible
> +	  for that domain to program the device to DMA to an arbitrary
> address.
> +	  The IOMMU is used to protect the host from malicious DMA by
> making
> +	  sure that the device addresses can only target memory assigned to
> the
> +	  guest.  However, when the guest domain is torn down, assigning the
> +	  device back to the hardware domain would allow any in-flight DMA
> to
> +	  potentially target critical host data.  To avoid this, quarantining
> +	  shold be enabled.  Quarantining can be done in two ways: In its
> basic
> +	  form, all in-flight DMA will simply be forced to encounter IOMMU
> +	  faults.  Since there are systems where doing so can cause host
> +	  lockup, an alternative form is available where writes to memory will
> +	  be made fault, but reads will be directed to a dummy page.  The
> +	  implication here is that such reads will go unnoticed, i.e. an admin
> +	  may not become aware of the underlying problem.
> +
> +	config IOMMU_QUARANTINE_NONE
> +		bool "none"
> +	config IOMMU_QUARANTINE_BASIC
> +		bool "basic"
> +	config IOMMU_QUARANTINE_FULL
> +		bool "full"
> +endchoice
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -25,6 +25,9 @@
>  #include "iommu.h"
>  #include "../ats.h"
> 
> +/* dom_io is used as a sentinel for quarantined devices */
> +#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)-
> >arch.root_table)
> +
>  static bool_t __read_mostly init_done;
> 
>  static const struct iommu_init_ops _iommu_init_ops;
> @@ -82,18 +85,35 @@ int get_dma_requestor_id(uint16_t seg, u
>      return req_id;
>  }
> 
> -static void amd_iommu_setup_domain_device(
> +static int __must_check allocate_domain_resources(struct domain_iommu
> *hd)
> +{
> +    int rc;
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +    rc = amd_iommu_alloc_root(hd);
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    return rc;
> +}
> +
> +static int __must_check amd_iommu_setup_domain_device(
>      struct domain *domain, struct amd_iommu *iommu,
>      uint8_t devfn, struct pci_dev *pdev)
>  {
>      struct amd_iommu_dte *table, *dte;
>      unsigned long flags;
> -    int req_id, valid = 1;
> +    int req_id, valid = 1, rc;
>      u8 bus = pdev->bus;
> -    const struct domain_iommu *hd = dom_iommu(domain);
> +    struct domain_iommu *hd = dom_iommu(domain);
> +
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
> +    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
> 
> -    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
> -            !iommu->dev_table.buffer );
> +    rc = allocate_domain_resources(hd);
> +    if ( rc )
> +        return rc;
> 
>      if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
>          valid = 0;
> @@ -148,6 +168,8 @@ static void amd_iommu_setup_domain_devic
> 
>          amd_iommu_flush_iotlb(devfn, pdev,
> INV_IOMMU_ALL_PAGES_ADDRESS, 0);
>      }
> +
> +    return 0;
>  }
> 
>  int __init acpi_ivrs_init(void)
> @@ -217,17 +239,6 @@ int amd_iommu_alloc_root(struct domain_i
>      return 0;
>  }
> 
> -static int __must_check allocate_domain_resources(struct domain_iommu
> *hd)
> -{
> -    int rc;
> -
> -    spin_lock(&hd->arch.mapping_lock);
> -    rc = amd_iommu_alloc_root(hd);
> -    spin_unlock(&hd->arch.mapping_lock);
> -
> -    return rc;
> -}
> -
>  int amd_iommu_get_paging_mode(unsigned long entries)
>  {
>      int level = 1;
> @@ -291,6 +302,9 @@ static void amd_iommu_disable_domain_dev
>      int req_id;
>      u8 bus = pdev->bus;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return;
> +
>      BUG_ON ( iommu->dev_table.buffer == NULL );
>      req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
>      table = iommu->dev_table.buffer;
> @@ -337,7 +351,6 @@ static int reassign_device(struct domain
>  {
>      struct amd_iommu *iommu;
>      int bdf, rc;
> -    struct domain_iommu *t = dom_iommu(target);
> 
>      bdf = PCI_BDF2(pdev->bus, pdev->devfn);
>      iommu = find_iommu_for_device(pdev->seg, bdf);
> @@ -358,11 +371,10 @@ static int reassign_device(struct domain
>          pdev->domain = target;
>      }
> 
> -    rc = allocate_domain_resources(t);
> +    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      if ( rc )
>          return rc;
> 
> -    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to
> dom%d\n",
>                      pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
>                      source->domain_id, target->domain_id);
> @@ -519,8 +531,7 @@ static int amd_iommu_add_device(u8 devfn
>          spin_unlock_irqrestore(&iommu->lock, flags);
>      }
> 
> -    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
> -    return 0;
> +    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn,
> pdev);
>  }
> 
>  static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -31,9 +31,24 @@ bool_t __initdata iommu_enable = 1;
>  bool_t __read_mostly iommu_enabled;
>  bool_t __read_mostly force_iommu;
>  bool_t __read_mostly iommu_verbose;
> -bool __read_mostly iommu_quarantine = true;
>  bool_t __read_mostly iommu_crash_disable;
> 
> +#define IOMMU_quarantine_none        0 /* aka false */
> +#define IOMMU_quarantine_fault       1 /* aka true */
> +#define IOMMU_quarantine_write_fault 2
> +#ifdef CONFIG_HAS_PCI
> +uint8_t __read_mostly iommu_quarantine =
> +# if defined(CONFIG_IOMMU_QUARANTINE_NONE)
> +    IOMMU_quarantine_none;
> +# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC)
> +    IOMMU_quarantine_fault;
> +# elif defined(CONFIG_IOMMU_QUARANTINE_FULL)
> +    IOMMU_quarantine_write_fault;
> +# endif
> +#else
> +# define iommu_quarantine IOMMU_quarantine_none
> +#endif /* CONFIG_HAS_PCI */
> +
>  static bool __hwdom_initdata iommu_hwdom_none;
>  bool __hwdom_initdata iommu_hwdom_strict;
>  bool __read_mostly iommu_hwdom_passthrough;
> @@ -68,8 +83,12 @@ static int __init parse_iommu_param(cons
>          else if ( (val = parse_boolean("force", s, ss)) >= 0 ||
>                    (val = parse_boolean("required", s, ss)) >= 0 )
>              force_iommu = val;
> +#ifdef CONFIG_HAS_PCI
>          else if ( (val = parse_boolean("quarantine", s, ss)) >= 0 )
>              iommu_quarantine = val;
> +        else if ( ss == s + 15 && !strncmp(s, "quarantine=full", 15) )
> +            iommu_quarantine = IOMMU_quarantine_write_fault;
> +#endif
>  #ifdef CONFIG_X86
>          else if ( (val = parse_boolean("igfx", s, ss)) >= 0 )
>              iommu_igfx = val;
> @@ -448,7 +467,7 @@ static int __init iommu_quarantine_init(
>      dom_io->options |= XEN_DOMCTL_CDF_iommu;
> 
>      rc = iommu_domain_init(dom_io, 0);
> -    if ( rc )
> +    if ( rc || iommu_quarantine < IOMMU_quarantine_write_fault )
>          return rc;
> 
>      if ( !hd->platform_ops->quarantine_init )
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -41,6 +41,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
> 
> +/* dom_io is used as a sentinel for quarantined devices */
> +#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)-
> >arch.pgd_maddr)
> +
>  struct mapped_rmrr {
>      struct list_head list;
>      u64 base, end;
> @@ -1294,6 +1297,9 @@ int domain_context_mapping_one(
>      int agaw, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
>      maddr = bus_to_context_maddr(iommu, bus);
> @@ -1548,6 +1554,9 @@ int domain_context_unmap_one(
>      int iommu_domid, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
> 
> @@ -1609,7 +1618,7 @@ static int domain_context_unmap(struct d
>  {
>      struct acpi_drhd_unit *drhd;
>      struct vtd_iommu *iommu;
> -    int ret = 0;
> +    int ret;
>      u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
>      int found = 0;
> 
> @@ -1625,14 +1634,12 @@ static int domain_context_unmap(struct d
>              printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u
> unmap\n",
>                     domain->domain_id, seg, bus,
>                     PCI_SLOT(devfn), PCI_FUNC(devfn));
> -        if ( !is_hardware_domain(domain) )
> -            return -EPERM;
> -        goto out;
> +        return is_hardware_domain(domain) ? 0 : -EPERM;
> 
>      case DEV_TYPE_PCIe_BRIDGE:
>      case DEV_TYPE_PCIe2PCI_BRIDGE:
>      case DEV_TYPE_LEGACY_PCI_BRIDGE:
> -        goto out;
> +        return 0;
> 
>      case DEV_TYPE_PCIe_ENDPOINT:
>          if ( iommu_debug )
> @@ -1676,10 +1683,12 @@ static int domain_context_unmap(struct d
>          dprintk(XENLOG_ERR VTDPREFIX,
> "d%d:unknown(%u): %04x:%02x:%02x.%u\n",
>                  domain->domain_id, pdev->type,
>                  seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> -        ret = -EINVAL;
> -        goto out;
> +        return -EINVAL;
>      }
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return ret;
> +
>      /*
>       * if no other devices under the same iommu owned by this domain,
>       * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
> @@ -1705,16 +1714,12 @@ static int domain_context_unmap(struct d
> 
>          iommu_domid = domain_iommu_domid(domain, iommu);
>          if ( iommu_domid == -1 )
> -        {
> -            ret = -EINVAL;
> -            goto out;
> -        }
> +            return -EINVAL;
> 
>          clear_bit(iommu_domid, iommu->domid_bitmap);
>          iommu->domid_map[iommu_domid] = 0;
>      }
> 
> -out:
>      return ret;
>  }
> 
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -53,7 +53,9 @@ static inline bool_t dfn_eq(dfn_t x, dfn
>  }
> 
>  extern bool_t iommu_enable, iommu_enabled;
> -extern bool force_iommu, iommu_quarantine, iommu_verbose;
> +extern bool force_iommu, iommu_verbose;
> +/* Boolean except for the specific purposes of
> drivers/passthrough/iommu.c. */
> +extern uint8_t iommu_quarantine;
> 
>  #ifdef CONFIG_X86
>  extern enum __packed iommu_intremap {
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-09 11:09 [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional Jan Beulich
  2020-03-10  3:43 ` Tian, Kevin
@ 2020-03-10  5:30 ` Tian, Kevin
  2020-03-10 10:31   ` Jan Beulich
  2020-03-10  8:58 ` Paul Durrant
  2 siblings, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2020-03-10  5:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper, Paul Durrant

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 9, 2020 7:09 PM
> 
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
> Passing through (such) devices (on such systems) is inherently insecure
> (as guests could easily arrange for IOMMU faults of any kind to occur).
> Defaulting to a mode where admins may not even become aware of issues
> with devices can be considered undesirable. Therefore convert this mode
> of operation to an optional one, not one enabled by default.

Here is another thought. The whole point of quarantine is to contain
the device after it is deassigned from untrusted guest. However, the
passthrough of such device is already insecure, as you mentioned. 
Then why do we care about making deassignment of such device
secure without doing anything to secure it when it is assigned and being
used by untrusted guest? I feel that one should simply put such device
out of the quarantine list in the first place, i.e. set quarantine=false and
then use tool to quarantine a whitelist of devices by skipping the bad one.

> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device()
> (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Take the opportunity and also limit the control to builds supporting
> PCI.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm happy to take better suggestions to replace the "full" command line
> option and Kconfig prompt tokens. I don't think though that "fault" and
> "write-fault" are really suitable there.
> ---
> This patch contextually depends on "[PATCH v2 0/5] IOMMU: restrict
> visibility/scope if certain variables".
> ---
> v3: IOMMU_quarantine_basic -> IOMMU_quarantine_fault,
>     IOMMU_quarantine_full -> IOMMU_quarantine_write_fault. Kconfig
>     option (choice) to select default. Limit to HAS_PCI.
> v2: Don't use true/false. Introduce QUARANTINE_SKIP() (albeit I'm not
>     really convinced this is an improvement). Add comment.
> 
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1238,7 +1238,7 @@ detection of systems known to misbehave
>  > Default: `new` unless directed-EOI is supported
> 
>  ### iommu
> -    = List of [ <bool>, verbose, debug, force, required, quarantine,
> +    = List of [ <bool>, verbose, debug, force, required, quarantine[=full],
>                  sharept, intremap, intpost, crash-disable,
>                  snoop, qinval, igfx, amd-iommu-perdev-intremap,
>                  dom0-{passthrough,strict} ]
> @@ -1276,11 +1276,15 @@ boolean (e.g. `iommu=no`) can override t
>      will prevent Xen from booting if IOMMUs aren't discovered and enabled
>      successfully.
> 
> -*   The `quarantine` boolean can be used to control Xen's behavior when
> -    de-assigning devices from guests.  If enabled (the default), Xen always
> -    quarantines such devices; they must be explicitly assigned back to Dom0
> -    before they can be used there again.  If disabled, Xen will only
> -    quarantine devices the toolstack hass arranged for getting quarantined.
> +*   The `quarantine` option can be used to control Xen's behavior when
> +    de-assigning devices from guests.  If set to true (the default), Xen
> +    always quarantines such devices; they must be explicitly assigned back
> +    to Dom0 before they can be used there again.  If set to "full", still
> +    active DMA will additionally be directed to a "sink" page.  If set to
> +    false, Xen will only quarantine devices the toolstack has arranged for
> +    getting quarantined.
> +
> +    This option is only valid on builds supporting PCI.
> 
>  *   The `sharept` boolean controls whether the IOMMU pagetables are
> shared
>      with the CPU-side HAP pagetables, or allocated separately.  Sharing
> --- a/xen/drivers/passthrough/Kconfig
> +++ b/xen/drivers/passthrough/Kconfig
> @@ -28,3 +28,31 @@ endif
> 
>  config IOMMU_FORCE_PT_SHARE
>  	bool
> +
> +choice
> +	prompt "IOMMU device quarantining default behavior"
> +	depends on HAS_PCI
> +	default IOMMU_QUARANTINE_BASIC
> +	---help---
> +	  When a PCI device is assigned to an untrusted domain, it is possible
> +	  for that domain to program the device to DMA to an arbitrary
> address.
> +	  The IOMMU is used to protect the host from malicious DMA by
> making
> +	  sure that the device addresses can only target memory assigned to
> the
> +	  guest.  However, when the guest domain is torn down, assigning the
> +	  device back to the hardware domain would allow any in-flight DMA
> to
> +	  potentially target critical host data.  To avoid this, quarantining
> +	  shold be enabled.  Quarantining can be done in two ways: In its
> basic
> +	  form, all in-flight DMA will simply be forced to encounter IOMMU
> +	  faults.  Since there are systems where doing so can cause host
> +	  lockup, an alternative form is available where writes to memory will
> +	  be made fault, but reads will be directed to a dummy page.  The
> +	  implication here is that such reads will go unnoticed, i.e. an admin
> +	  may not become aware of the underlying problem.
> +
> +	config IOMMU_QUARANTINE_NONE
> +		bool "none"
> +	config IOMMU_QUARANTINE_BASIC
> +		bool "basic"
> +	config IOMMU_QUARANTINE_FULL
> +		bool "full"
> +endchoice
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -25,6 +25,9 @@
>  #include "iommu.h"
>  #include "../ats.h"
> 
> +/* dom_io is used as a sentinel for quarantined devices */
> +#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)-
> >arch.root_table)
> +
>  static bool_t __read_mostly init_done;
> 
>  static const struct iommu_init_ops _iommu_init_ops;
> @@ -82,18 +85,35 @@ int get_dma_requestor_id(uint16_t seg, u
>      return req_id;
>  }
> 
> -static void amd_iommu_setup_domain_device(
> +static int __must_check allocate_domain_resources(struct domain_iommu
> *hd)
> +{
> +    int rc;
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +    rc = amd_iommu_alloc_root(hd);
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    return rc;
> +}
> +
> +static int __must_check amd_iommu_setup_domain_device(
>      struct domain *domain, struct amd_iommu *iommu,
>      uint8_t devfn, struct pci_dev *pdev)
>  {
>      struct amd_iommu_dte *table, *dte;
>      unsigned long flags;
> -    int req_id, valid = 1;
> +    int req_id, valid = 1, rc;
>      u8 bus = pdev->bus;
> -    const struct domain_iommu *hd = dom_iommu(domain);
> +    struct domain_iommu *hd = dom_iommu(domain);
> +
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
> +    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
> 
> -    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
> -            !iommu->dev_table.buffer );
> +    rc = allocate_domain_resources(hd);
> +    if ( rc )
> +        return rc;
> 
>      if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
>          valid = 0;
> @@ -148,6 +168,8 @@ static void amd_iommu_setup_domain_devic
> 
>          amd_iommu_flush_iotlb(devfn, pdev,
> INV_IOMMU_ALL_PAGES_ADDRESS, 0);
>      }
> +
> +    return 0;
>  }
> 
>  int __init acpi_ivrs_init(void)
> @@ -217,17 +239,6 @@ int amd_iommu_alloc_root(struct domain_i
>      return 0;
>  }
> 
> -static int __must_check allocate_domain_resources(struct domain_iommu
> *hd)
> -{
> -    int rc;
> -
> -    spin_lock(&hd->arch.mapping_lock);
> -    rc = amd_iommu_alloc_root(hd);
> -    spin_unlock(&hd->arch.mapping_lock);
> -
> -    return rc;
> -}
> -
>  int amd_iommu_get_paging_mode(unsigned long entries)
>  {
>      int level = 1;
> @@ -291,6 +302,9 @@ static void amd_iommu_disable_domain_dev
>      int req_id;
>      u8 bus = pdev->bus;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return;
> +
>      BUG_ON ( iommu->dev_table.buffer == NULL );
>      req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
>      table = iommu->dev_table.buffer;
> @@ -337,7 +351,6 @@ static int reassign_device(struct domain
>  {
>      struct amd_iommu *iommu;
>      int bdf, rc;
> -    struct domain_iommu *t = dom_iommu(target);
> 
>      bdf = PCI_BDF2(pdev->bus, pdev->devfn);
>      iommu = find_iommu_for_device(pdev->seg, bdf);
> @@ -358,11 +371,10 @@ static int reassign_device(struct domain
>          pdev->domain = target;
>      }
> 
> -    rc = allocate_domain_resources(t);
> +    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      if ( rc )
>          return rc;
> 
> -    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to
> dom%d\n",
>                      pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
>                      source->domain_id, target->domain_id);
> @@ -519,8 +531,7 @@ static int amd_iommu_add_device(u8 devfn
>          spin_unlock_irqrestore(&iommu->lock, flags);
>      }
> 
> -    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
> -    return 0;
> +    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn,
> pdev);
>  }
> 
>  static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -31,9 +31,24 @@ bool_t __initdata iommu_enable = 1;
>  bool_t __read_mostly iommu_enabled;
>  bool_t __read_mostly force_iommu;
>  bool_t __read_mostly iommu_verbose;
> -bool __read_mostly iommu_quarantine = true;
>  bool_t __read_mostly iommu_crash_disable;
> 
> +#define IOMMU_quarantine_none        0 /* aka false */
> +#define IOMMU_quarantine_fault       1 /* aka true */
> +#define IOMMU_quarantine_write_fault 2
> +#ifdef CONFIG_HAS_PCI
> +uint8_t __read_mostly iommu_quarantine =
> +# if defined(CONFIG_IOMMU_QUARANTINE_NONE)
> +    IOMMU_quarantine_none;
> +# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC)
> +    IOMMU_quarantine_fault;
> +# elif defined(CONFIG_IOMMU_QUARANTINE_FULL)
> +    IOMMU_quarantine_write_fault;
> +# endif
> +#else
> +# define iommu_quarantine IOMMU_quarantine_none
> +#endif /* CONFIG_HAS_PCI */
> +
>  static bool __hwdom_initdata iommu_hwdom_none;
>  bool __hwdom_initdata iommu_hwdom_strict;
>  bool __read_mostly iommu_hwdom_passthrough;
> @@ -68,8 +83,12 @@ static int __init parse_iommu_param(cons
>          else if ( (val = parse_boolean("force", s, ss)) >= 0 ||
>                    (val = parse_boolean("required", s, ss)) >= 0 )
>              force_iommu = val;
> +#ifdef CONFIG_HAS_PCI
>          else if ( (val = parse_boolean("quarantine", s, ss)) >= 0 )
>              iommu_quarantine = val;
> +        else if ( ss == s + 15 && !strncmp(s, "quarantine=full", 15) )
> +            iommu_quarantine = IOMMU_quarantine_write_fault;
> +#endif
>  #ifdef CONFIG_X86
>          else if ( (val = parse_boolean("igfx", s, ss)) >= 0 )
>              iommu_igfx = val;
> @@ -448,7 +467,7 @@ static int __init iommu_quarantine_init(
>      dom_io->options |= XEN_DOMCTL_CDF_iommu;
> 
>      rc = iommu_domain_init(dom_io, 0);
> -    if ( rc )
> +    if ( rc || iommu_quarantine < IOMMU_quarantine_write_fault )
>          return rc;
> 
>      if ( !hd->platform_ops->quarantine_init )
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -41,6 +41,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
> 
> +/* dom_io is used as a sentinel for quarantined devices */
> +#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)-
> >arch.pgd_maddr)
> +
>  struct mapped_rmrr {
>      struct list_head list;
>      u64 base, end;
> @@ -1294,6 +1297,9 @@ int domain_context_mapping_one(
>      int agaw, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
>      maddr = bus_to_context_maddr(iommu, bus);
> @@ -1548,6 +1554,9 @@ int domain_context_unmap_one(
>      int iommu_domid, rc, ret;
>      bool_t flush_dev_iotlb;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
>      ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
> 
> @@ -1609,7 +1618,7 @@ static int domain_context_unmap(struct d
>  {
>      struct acpi_drhd_unit *drhd;
>      struct vtd_iommu *iommu;
> -    int ret = 0;
> +    int ret;
>      u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
>      int found = 0;
> 
> @@ -1625,14 +1634,12 @@ static int domain_context_unmap(struct d
>              printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u
> unmap\n",
>                     domain->domain_id, seg, bus,
>                     PCI_SLOT(devfn), PCI_FUNC(devfn));
> -        if ( !is_hardware_domain(domain) )
> -            return -EPERM;
> -        goto out;
> +        return is_hardware_domain(domain) ? 0 : -EPERM;
> 
>      case DEV_TYPE_PCIe_BRIDGE:
>      case DEV_TYPE_PCIe2PCI_BRIDGE:
>      case DEV_TYPE_LEGACY_PCI_BRIDGE:
> -        goto out;
> +        return 0;
> 
>      case DEV_TYPE_PCIe_ENDPOINT:
>          if ( iommu_debug )
> @@ -1676,10 +1683,12 @@ static int domain_context_unmap(struct d
>          dprintk(XENLOG_ERR VTDPREFIX,
> "d%d:unknown(%u): %04x:%02x:%02x.%u\n",
>                  domain->domain_id, pdev->type,
>                  seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> -        ret = -EINVAL;
> -        goto out;
> +        return -EINVAL;
>      }
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return ret;
> +
>      /*
>       * if no other devices under the same iommu owned by this domain,
>       * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
> @@ -1705,16 +1714,12 @@ static int domain_context_unmap(struct d
> 
>          iommu_domid = domain_iommu_domid(domain, iommu);
>          if ( iommu_domid == -1 )
> -        {
> -            ret = -EINVAL;
> -            goto out;
> -        }
> +            return -EINVAL;
> 
>          clear_bit(iommu_domid, iommu->domid_bitmap);
>          iommu->domid_map[iommu_domid] = 0;
>      }
> 
> -out:
>      return ret;
>  }
> 
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -53,7 +53,9 @@ static inline bool_t dfn_eq(dfn_t x, dfn
>  }
> 
>  extern bool_t iommu_enable, iommu_enabled;
> -extern bool force_iommu, iommu_quarantine, iommu_verbose;
> +extern bool force_iommu, iommu_verbose;
> +/* Boolean except for the specific purposes of
> drivers/passthrough/iommu.c. */
> +extern uint8_t iommu_quarantine;
> 
>  #ifdef CONFIG_X86
>  extern enum __packed iommu_intremap {
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-09 11:09 [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional Jan Beulich
  2020-03-10  3:43 ` Tian, Kevin
  2020-03-10  5:30 ` Tian, Kevin
@ 2020-03-10  8:58 ` Paul Durrant
  2020-03-10 10:10   ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2020-03-10  8:58 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel
  Cc: 'Andrew Cooper', 'Kevin Tian'

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 09 March 2020 11:09
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant <paul@xen.org>; Kevin Tian
> <kevin.tian@intel.com>
> Subject: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
> Passing through (such) devices (on such systems) is inherently insecure
> (as guests could easily arrange for IOMMU faults of any kind to occur).
> Defaulting to a mode where admins may not even become aware of issues
> with devices can be considered undesirable. Therefore convert this mode
> of operation to an optional one, not one enabled by default.
> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Take the opportunity and also limit the control to builds supporting
> PCI.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm happy to take better suggestions to replace the "full" command line
> option and Kconfig prompt tokens. I don't think though that "fault" and
> "write-fault" are really suitable there.
> ---
> This patch contextually depends on "[PATCH v2 0/5] IOMMU: restrict
> visibility/scope if certain variables".
> ---
> v3: IOMMU_quarantine_basic -> IOMMU_quarantine_fault,
>     IOMMU_quarantine_full -> IOMMU_quarantine_write_fault. Kconfig
>     option (choice) to select default. Limit to HAS_PCI.
> v2: Don't use true/false. Introduce QUARANTINE_SKIP() (albeit I'm not
>     really convinced this is an improvement). Add comment.
> 
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1238,7 +1238,7 @@ detection of systems known to misbehave
>  > Default: `new` unless directed-EOI is supported
> 
>  ### iommu
> -    = List of [ <bool>, verbose, debug, force, required, quarantine,
> +    = List of [ <bool>, verbose, debug, force, required, quarantine[=full],
>                  sharept, intremap, intpost, crash-disable,
>                  snoop, qinval, igfx, amd-iommu-perdev-intremap,
>                  dom0-{passthrough,strict} ]
> @@ -1276,11 +1276,15 @@ boolean (e.g. `iommu=no`) can override t
>      will prevent Xen from booting if IOMMUs aren't discovered and enabled
>      successfully.
> 
> -*   The `quarantine` boolean can be used to control Xen's behavior when
> -    de-assigning devices from guests.  If enabled (the default), Xen always
> -    quarantines such devices; they must be explicitly assigned back to Dom0
> -    before they can be used there again.  If disabled, Xen will only
> -    quarantine devices the toolstack hass arranged for getting quarantined.
> +*   The `quarantine` option can be used to control Xen's behavior when
> +    de-assigning devices from guests.  If set to true (the default), Xen
> +    always quarantines such devices; they must be explicitly assigned back
> +    to Dom0 before they can be used there again.  If set to "full", still
> +    active DMA will additionally be directed to a "sink" page.

I realise this is only in the diff context, but I'm not sure what the following sentence actually means:

>  If set to
> +    false, Xen will only quarantine devices the toolstack has arranged for
> +    getting quarantined.
> +

Sounds tautological to me.

> +    This option is only valid on builds supporting PCI.
> 
>  *   The `sharept` boolean controls whether the IOMMU pagetables are shared
>      with the CPU-side HAP pagetables, or allocated separately.  Sharing
> --- a/xen/drivers/passthrough/Kconfig
> +++ b/xen/drivers/passthrough/Kconfig
> @@ -28,3 +28,31 @@ endif
> 
>  config IOMMU_FORCE_PT_SHARE
>  	bool
> +
> +choice
> +	prompt "IOMMU device quarantining default behavior"
> +	depends on HAS_PCI
> +	default IOMMU_QUARANTINE_BASIC
> +	---help---
> +	  When a PCI device is assigned to an untrusted domain, it is possible
> +	  for that domain to program the device to DMA to an arbitrary address.
> +	  The IOMMU is used to protect the host from malicious DMA by making
> +	  sure that the device addresses can only target memory assigned to the
> +	  guest.  However, when the guest domain is torn down, assigning the
> +	  device back to the hardware domain would allow any in-flight DMA to
> +	  potentially target critical host data.  To avoid this, quarantining
> +	  shold be enabled.

IMO the above text is a good summary and it would be useful it were duplicated in the command line documentation.

>  Quarantining can be done in two ways: In its basic
> +	  form, all in-flight DMA will simply be forced to encounter IOMMU
> +	  faults.  Since there are systems where doing so can cause host
> +	  lockup, an alternative form is available where writes to memory will
> +	  be made fault, but reads will be directed to a dummy page.  The
> +	  implication here is that such reads will go unnoticed, i.e. an admin
> +	  may not become aware of the underlying problem.
> +
> +	config IOMMU_QUARANTINE_NONE
> +		bool "none"
> +	config IOMMU_QUARANTINE_BASIC
> +		bool "basic"
> +	config IOMMU_QUARANTINE_FULL
> +		bool "full"

'scratch_page' perhaps? Seems a bit more self-explanatory.

> +endchoice
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -25,6 +25,9 @@
>  #include "iommu.h"
>  #include "../ats.h"
> 
> +/* dom_io is used as a sentinel for quarantined devices */
> +#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.root_table)
> +
>  static bool_t __read_mostly init_done;
> 
>  static const struct iommu_init_ops _iommu_init_ops;
> @@ -82,18 +85,35 @@ int get_dma_requestor_id(uint16_t seg, u
>      return req_id;
>  }
> 
> -static void amd_iommu_setup_domain_device(
> +static int __must_check allocate_domain_resources(struct domain_iommu *hd)
> +{
> +    int rc;
> +
> +    spin_lock(&hd->arch.mapping_lock);
> +    rc = amd_iommu_alloc_root(hd);
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    return rc;
> +}
> +
> +static int __must_check amd_iommu_setup_domain_device(
>      struct domain *domain, struct amd_iommu *iommu,
>      uint8_t devfn, struct pci_dev *pdev)
>  {
>      struct amd_iommu_dte *table, *dte;
>      unsigned long flags;
> -    int req_id, valid = 1;
> +    int req_id, valid = 1, rc;
>      u8 bus = pdev->bus;
> -    const struct domain_iommu *hd = dom_iommu(domain);
> +    struct domain_iommu *hd = dom_iommu(domain);
> +
> +    if ( QUARANTINE_SKIP(domain) )
> +        return 0;
> +
> +    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
> 
> -    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
> -            !iommu->dev_table.buffer );
> +    rc = allocate_domain_resources(hd);
> +    if ( rc )
> +        return rc;
> 
>      if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
>          valid = 0;
> @@ -148,6 +168,8 @@ static void amd_iommu_setup_domain_devic
> 
>          amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
>      }
> +
> +    return 0;
>  }
> 
>  int __init acpi_ivrs_init(void)
> @@ -217,17 +239,6 @@ int amd_iommu_alloc_root(struct domain_i
>      return 0;
>  }
> 
> -static int __must_check allocate_domain_resources(struct domain_iommu *hd)
> -{
> -    int rc;
> -
> -    spin_lock(&hd->arch.mapping_lock);
> -    rc = amd_iommu_alloc_root(hd);
> -    spin_unlock(&hd->arch.mapping_lock);
> -
> -    return rc;
> -}
> -
>  int amd_iommu_get_paging_mode(unsigned long entries)
>  {
>      int level = 1;
> @@ -291,6 +302,9 @@ static void amd_iommu_disable_domain_dev
>      int req_id;
>      u8 bus = pdev->bus;
> 
> +    if ( QUARANTINE_SKIP(domain) )
> +        return;
> +
>      BUG_ON ( iommu->dev_table.buffer == NULL );
>      req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
>      table = iommu->dev_table.buffer;
> @@ -337,7 +351,6 @@ static int reassign_device(struct domain
>  {
>      struct amd_iommu *iommu;
>      int bdf, rc;
> -    struct domain_iommu *t = dom_iommu(target);
> 
>      bdf = PCI_BDF2(pdev->bus, pdev->devfn);
>      iommu = find_iommu_for_device(pdev->seg, bdf);
> @@ -358,11 +371,10 @@ static int reassign_device(struct domain
>          pdev->domain = target;
>      }
> 
> -    rc = allocate_domain_resources(t);
> +    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      if ( rc )
>          return rc;
> 
> -    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>      AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
>                      pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
>                      source->domain_id, target->domain_id);
> @@ -519,8 +531,7 @@ static int amd_iommu_add_device(u8 devfn
>          spin_unlock_irqrestore(&iommu->lock, flags);
>      }
> 
> -    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
> -    return 0;
> +    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
>  }
> 
>  static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -31,9 +31,24 @@ bool_t __initdata iommu_enable = 1;
>  bool_t __read_mostly iommu_enabled;
>  bool_t __read_mostly force_iommu;
>  bool_t __read_mostly iommu_verbose;
> -bool __read_mostly iommu_quarantine = true;
>  bool_t __read_mostly iommu_crash_disable;
> 
> +#define IOMMU_quarantine_none        0 /* aka false */
> +#define IOMMU_quarantine_fault       1 /* aka true */
> +#define IOMMU_quarantine_write_fault 2

enum for the above perhaps?

  Paul



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10  8:58 ` Paul Durrant
@ 2020-03-10 10:10   ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 10:10 UTC (permalink / raw)
  To: paul; +Cc: xen-devel, 'Kevin Tian', 'Andrew Cooper'

On 10.03.2020 09:58, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 09 March 2020 11:09
>>
>> @@ -1276,11 +1276,15 @@ boolean (e.g. `iommu=no`) can override t
>>      will prevent Xen from booting if IOMMUs aren't discovered and enabled
>>      successfully.
>>
>> -*   The `quarantine` boolean can be used to control Xen's behavior when
>> -    de-assigning devices from guests.  If enabled (the default), Xen always
>> -    quarantines such devices; they must be explicitly assigned back to Dom0
>> -    before they can be used there again.  If disabled, Xen will only
>> -    quarantine devices the toolstack hass arranged for getting quarantined.
>> +*   The `quarantine` option can be used to control Xen's behavior when
>> +    de-assigning devices from guests.  If set to true (the default), Xen
>> +    always quarantines such devices; they must be explicitly assigned back
>> +    to Dom0 before they can be used there again.  If set to "full", still
>> +    active DMA will additionally be directed to a "sink" page.
> 
> I realise this is only in the diff context, but I'm not sure what the following sentence actually means:
> 
>>  If set to
>> +    false, Xen will only quarantine devices the toolstack has arranged for
>> +    getting quarantined.
>> +
> 
> Sounds tautological to me.

Not to me - "false" could also mean no quarantining at all (i.e. no
tool stack control).

>> --- a/xen/drivers/passthrough/Kconfig
>> +++ b/xen/drivers/passthrough/Kconfig
>> @@ -28,3 +28,31 @@ endif
>>
>>  config IOMMU_FORCE_PT_SHARE
>>  	bool
>> +
>> +choice
>> +	prompt "IOMMU device quarantining default behavior"
>> +	depends on HAS_PCI
>> +	default IOMMU_QUARANTINE_BASIC
>> +	---help---
>> +	  When a PCI device is assigned to an untrusted domain, it is possible
>> +	  for that domain to program the device to DMA to an arbitrary address.
>> +	  The IOMMU is used to protect the host from malicious DMA by making
>> +	  sure that the device addresses can only target memory assigned to the
>> +	  guest.  However, when the guest domain is torn down, assigning the
>> +	  device back to the hardware domain would allow any in-flight DMA to
>> +	  potentially target critical host data.  To avoid this, quarantining
>> +	  shold be enabled.
> 
> IMO the above text is a good summary and it would be useful it were
> duplicated in the command line documentation.

Can do.

>>  Quarantining can be done in two ways: In its basic
>> +	  form, all in-flight DMA will simply be forced to encounter IOMMU
>> +	  faults.  Since there are systems where doing so can cause host
>> +	  lockup, an alternative form is available where writes to memory will
>> +	  be made fault, but reads will be directed to a dummy page.  The
>> +	  implication here is that such reads will go unnoticed, i.e. an admin
>> +	  may not become aware of the underlying problem.
>> +
>> +	config IOMMU_QUARANTINE_NONE
>> +		bool "none"
>> +	config IOMMU_QUARANTINE_BASIC
>> +		bool "basic"
>> +	config IOMMU_QUARANTINE_FULL
>> +		bool "full"
> 
> 'scratch_page' perhaps? Seems a bit more self-explanatory.

But it still wouldn't carry the "reads only" aspect. I'll reply to
Kevin later, where this aspect will be of more interest. But yes,
I'll think some more about perhaps using that.

>> --- a/xen/drivers/passthrough/iommu.c
>> +++ b/xen/drivers/passthrough/iommu.c
>> @@ -31,9 +31,24 @@ bool_t __initdata iommu_enable = 1;
>>  bool_t __read_mostly iommu_enabled;
>>  bool_t __read_mostly force_iommu;
>>  bool_t __read_mostly iommu_verbose;
>> -bool __read_mostly iommu_quarantine = true;
>>  bool_t __read_mostly iommu_crash_disable;
>>
>> +#define IOMMU_quarantine_none        0 /* aka false */
>> +#define IOMMU_quarantine_fault       1 /* aka true */
>> +#define IOMMU_quarantine_write_fault 2
> 
> enum for the above perhaps?

Well, I don't want the variable to become wider than a byte. And I
couldn't really settle between the ugliness of not using an enum
and the ugliness of attaching __packed to it. If you have a clear
preference for a packed enum, I'll switch - just let me know.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10  3:43 ` Tian, Kevin
@ 2020-03-10 10:26   ` Jan Beulich
  2020-03-10 12:30     ` Paul Durrant
  2020-03-13  2:28     ` Tian, Kevin
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 10:26 UTC (permalink / raw)
  To: Tian, Kevin, Paul Durrant; +Cc: xen-devel, Andrew Cooper

On 10.03.2020 04:43, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 9, 2020 7:09 PM
>>
>> I'm happy to take better suggestions to replace the "full" command line
>> option and Kconfig prompt tokens. I don't think though that "fault" and
>> "write-fault" are really suitable there.
> 
> I think we may just allow both r/w access to scratch page for such bogus
> device, which may make 'full' more reasonable since we now fully
> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
> alone for such devices (just because one observed his specific device only 
> has problem with read-fault).

Well, a fundamental problem I have here is that I still don't know
the _exact_ conditions for the observed hangs. I consider it unlikely
for IOMMU read faults to cause hangs, but for write faults to be
"fine". It would seem more likely to me that e.g. a non-present
context entry might cause issues. If that was the case, we wouldn't
need to handle reads and writes differently; we could instead install
an all zero top level page table. And we'd still get all faults that
are supposed to surface. But perhaps Paul did try this back then, and
it turned out to not be an option.

The choice of letting writes continue to fault was based on (a) this
having been tested to work on the affected system(s) and (b) also
letting writes go to a scratch page requiring a per-device scratch
page (and associated page tables) rather than a system-wide one, as
devices coming from different domains would otherwise be able to
observe data written to memory by respectively "foreign" devices
(and hence domains).

But this is all guesswork without the firmware writers of affected
systems giving us at least some hints.

> alternatively I also thought about whether whitelisting the problematic 
> devices through another option (e.g. nofault=b:d:f) could provide more
> value. In concept any IOMMU page table (dom0, dom_io or domU) 
> for such bogus device should not include invalid entry, even when 
> quarantine is not specified. However I'm not sure whether it's worthy of 
> going so far...

Indeed. Question though is whether this bad behavior is device specific
(rather than e.g. system dependent). Plus - as per above - question
also is whether it's really leaf (or intermediate) page table entry
presence which actually matters here. If it was, I agree we shouldn't
have any non-present entries anywhere in the page table trees.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10  5:30 ` Tian, Kevin
@ 2020-03-10 10:31   ` Jan Beulich
  2020-03-13  3:05     ` Tian, Kevin
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 10:31 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: xen-devel, Paul Durrant, Andrew Cooper

On 10.03.2020 06:30, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 9, 2020 7:09 PM
>>
>> Containing still in flight DMA was introduced to work around certain
>> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
>> Passing through (such) devices (on such systems) is inherently insecure
>> (as guests could easily arrange for IOMMU faults of any kind to occur).
>> Defaulting to a mode where admins may not even become aware of issues
>> with devices can be considered undesirable. Therefore convert this mode
>> of operation to an optional one, not one enabled by default.
> 
> Here is another thought. The whole point of quarantine is to contain
> the device after it is deassigned from untrusted guest.

I'd question the "untrusted" here. Assigning devices to untrusted
guests is problematic anyway, unless you're the device manufacturer
and device firmware writer, and hence you can guarantee the device
to not offer any backdoors or alike. Therefore I view quarantining
more as a protection of the host against bad device behavior, and
less against malicious guest behavior (while the driver in the
guest surely has some influence, consider the guest getting crashed
and even a well-behaved driver hence not getting any chance to
silence the device).

Jan

> However, the
> passthrough of such device is already insecure, as you mentioned. 
> Then why do we care about making deassignment of such device
> secure without doing anything to secure it when it is assigned and being
> used by untrusted guest? I feel that one should simply put such device
> out of the quarantine list in the first place, i.e. set quarantine=false and
> then use tool to quarantine a whitelist of devices by skipping the bad one.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 10:26   ` Jan Beulich
@ 2020-03-10 12:30     ` Paul Durrant
  2020-03-10 15:04       ` Jan Beulich
  2020-03-13  2:28     ` Tian, Kevin
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2020-03-10 12:30 UTC (permalink / raw)
  To: 'Jan Beulich', 'Tian, Kevin'
  Cc: xen-devel, 'Andrew Cooper'

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 10 March 2020 10:27
> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> On 10.03.2020 04:43, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 9, 2020 7:09 PM
> >>
> >> I'm happy to take better suggestions to replace the "full" command line
> >> option and Kconfig prompt tokens. I don't think though that "fault" and
> >> "write-fault" are really suitable there.
> >
> > I think we may just allow both r/w access to scratch page for such bogus
> > device, which may make 'full' more reasonable since we now fully
> > contain in-fly DMAs. I'm not sure about the value of keeping write-fault
> > alone for such devices (just because one observed his specific device only
> > has problem with read-fault).
> 
> Well, a fundamental problem I have here is that I still don't know
> the _exact_ conditions for the observed hangs. I consider it unlikely
> for IOMMU read faults to cause hangs, but for write faults to be
> "fine".

AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault being synchronous causes the device's state machine to lock up. It really is observed behaviour.

> It would seem more likely to me that e.g. a non-present
> context entry might cause issues. If that was the case, we wouldn't
> need to handle reads and writes differently; we could instead install
> an all zero top level page table. And we'd still get all faults that
> are supposed to surface. But perhaps Paul did try this back then, and
> it turned out to not be an option.
> 

The only info I had was that faults on DMA reads had to avoided completely. I did not have access to the h/w in question at the time. I may be able to get it now.

  Paul

> The choice of letting writes continue to fault was based on (a) this
> having been tested to work on the affected system(s) and (b) also
> letting writes go to a scratch page requiring a per-device scratch
> page (and associated page tables) rather than a system-wide one, as
> devices coming from different domains would otherwise be able to
> observe data written to memory by respectively "foreign" devices
> (and hence domains).
> 
> But this is all guesswork without the firmware writers of affected
> systems giving us at least some hints.
> 
> > alternatively I also thought about whether whitelisting the problematic
> > devices through another option (e.g. nofault=b:d:f) could provide more
> > value. In concept any IOMMU page table (dom0, dom_io or domU)
> > for such bogus device should not include invalid entry, even when
> > quarantine is not specified. However I'm not sure whether it's worthy of
> > going so far...
> 
> Indeed. Question though is whether this bad behavior is device specific
> (rather than e.g. system dependent). Plus - as per above - question
> also is whether it's really leaf (or intermediate) page table entry
> presence which actually matters here. If it was, I agree we shouldn't
> have any non-present entries anywhere in the page table trees.
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 12:30     ` Paul Durrant
@ 2020-03-10 15:04       ` Jan Beulich
  2020-03-10 15:13         ` Paul Durrant
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 15:04 UTC (permalink / raw)
  To: paul; +Cc: xen-devel, 'Tian, Kevin', 'Andrew Cooper'

On 10.03.2020 13:30, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 10 March 2020 10:27
>> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
>> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>
>> On 10.03.2020 04:43, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Monday, March 9, 2020 7:09 PM
>>>>
>>>> I'm happy to take better suggestions to replace the "full" command line
>>>> option and Kconfig prompt tokens. I don't think though that "fault" and
>>>> "write-fault" are really suitable there.
>>>
>>> I think we may just allow both r/w access to scratch page for such bogus
>>> device, which may make 'full' more reasonable since we now fully
>>> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
>>> alone for such devices (just because one observed his specific device only
>>> has problem with read-fault).
>>
>> Well, a fundamental problem I have here is that I still don't know
>> the _exact_ conditions for the observed hangs. I consider it unlikely
>> for IOMMU read faults to cause hangs, but for write faults to be
>> "fine".
> 
> AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault being synchronous causes the device's state machine to lock up. It really is observed behaviour.
> 
>> It would seem more likely to me that e.g. a non-present
>> context entry might cause issues. If that was the case, we wouldn't
>> need to handle reads and writes differently; we could instead install
>> an all zero top level page table. And we'd still get all faults that
>> are supposed to surface. But perhaps Paul did try this back then, and
>> it turned out to not be an option.
>>
> 
> The only info I had was that faults on DMA reads had to avoided
> completely. I did not have access to the h/w in question at the
> time. I may be able to get it now.

I see. The implication then is, as Kevin said, that we mustn't run
guests with _any_ IOMMU PTEs having their "read" bits clear.
Anything that's "not present" now would need directing to a scratch
page. I then further wonder what effect reads to addresses beyond
AGAW would have. It may be impossible to arrange for sufficiently
secure pass-through with such a device, at which point - again as
said by Kevin - there may be little point in the scratch page
based quarantining.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 15:04       ` Jan Beulich
@ 2020-03-10 15:13         ` Paul Durrant
  2020-03-10 15:44           ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2020-03-10 15:13 UTC (permalink / raw)
  To: 'Jan Beulich'
  Cc: xen-devel, 'Tian, Kevin', 'Andrew Cooper'

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 10 March 2020 15:05
> To: paul@xen.org
> Cc: 'Tian, Kevin' <kevin.tian@intel.com>; xen-devel@lists.xenproject.org; 'Andrew Cooper'
> <andrew.cooper3@citrix.com>
> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> On 10.03.2020 13:30, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 10 March 2020 10:27
> >> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
> >> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
> >> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> >>
> >> On 10.03.2020 04:43, Tian, Kevin wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Monday, March 9, 2020 7:09 PM
> >>>>
> >>>> I'm happy to take better suggestions to replace the "full" command line
> >>>> option and Kconfig prompt tokens. I don't think though that "fault" and
> >>>> "write-fault" are really suitable there.
> >>>
> >>> I think we may just allow both r/w access to scratch page for such bogus
> >>> device, which may make 'full' more reasonable since we now fully
> >>> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
> >>> alone for such devices (just because one observed his specific device only
> >>> has problem with read-fault).
> >>
> >> Well, a fundamental problem I have here is that I still don't know
> >> the _exact_ conditions for the observed hangs. I consider it unlikely
> >> for IOMMU read faults to cause hangs, but for write faults to be
> >> "fine".
> >
> > AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault
> being synchronous causes the device's state machine to lock up. It really is observed behaviour.
> >
> >> It would seem more likely to me that e.g. a non-present
> >> context entry might cause issues. If that was the case, we wouldn't
> >> need to handle reads and writes differently; we could instead install
> >> an all zero top level page table. And we'd still get all faults that
> >> are supposed to surface. But perhaps Paul did try this back then, and
> >> it turned out to not be an option.
> >>
> >
> > The only info I had was that faults on DMA reads had to avoided
> > completely. I did not have access to the h/w in question at the
> > time. I may be able to get it now.
> 
> I see. The implication then is, as Kevin said, that we mustn't run
> guests with _any_ IOMMU PTEs having their "read" bits clear.
> Anything that's "not present" now would need directing to a scratch
> page. I then further wonder what effect reads to addresses beyond
> AGAW would have. It may be impossible to arrange for sufficiently
> secure pass-through with such a device, at which point - again as
> said by Kevin - there may be little point in the scratch page
> based quarantining.
> 

Well, I can't say there's little point in it as it does fix a host lock-up.

You say "as Kevin said, that we mustn't run guests with _any_ IOMMU PTEs having their "read" bits clear"... I can't find that. I did find where he said "In concept any IOMMU page table (dom0, dom_io or domU) for such bogus device should not include invalid entry", but that's a different thing. However, is a really saying that things will break if any of the PTEs has their present bit clear?

  Paul



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 15:13         ` Paul Durrant
@ 2020-03-10 15:44           ` Jan Beulich
  2020-03-10 16:05             ` Paul Durrant
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 15:44 UTC (permalink / raw)
  To: paul; +Cc: xen-devel, 'Tian, Kevin', 'Andrew Cooper'

On 10.03.2020 16:13, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 10 March 2020 15:05
>> To: paul@xen.org
>> Cc: 'Tian, Kevin' <kevin.tian@intel.com>; xen-devel@lists.xenproject.org; 'Andrew Cooper'
>> <andrew.cooper3@citrix.com>
>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>
>> On 10.03.2020 13:30, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 10 March 2020 10:27
>>>> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
>>>> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>>>
>>>> On 10.03.2020 04:43, Tian, Kevin wrote:
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: Monday, March 9, 2020 7:09 PM
>>>>>>
>>>>>> I'm happy to take better suggestions to replace the "full" command line
>>>>>> option and Kconfig prompt tokens. I don't think though that "fault" and
>>>>>> "write-fault" are really suitable there.
>>>>>
>>>>> I think we may just allow both r/w access to scratch page for such bogus
>>>>> device, which may make 'full' more reasonable since we now fully
>>>>> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
>>>>> alone for such devices (just because one observed his specific device only
>>>>> has problem with read-fault).
>>>>
>>>> Well, a fundamental problem I have here is that I still don't know
>>>> the _exact_ conditions for the observed hangs. I consider it unlikely
>>>> for IOMMU read faults to cause hangs, but for write faults to be
>>>> "fine".
>>>
>>> AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault
>> being synchronous causes the device's state machine to lock up. It really is observed behaviour.
>>>
>>>> It would seem more likely to me that e.g. a non-present
>>>> context entry might cause issues. If that was the case, we wouldn't
>>>> need to handle reads and writes differently; we could instead install
>>>> an all zero top level page table. And we'd still get all faults that
>>>> are supposed to surface. But perhaps Paul did try this back then, and
>>>> it turned out to not be an option.
>>>>
>>>
>>> The only info I had was that faults on DMA reads had to avoided
>>> completely. I did not have access to the h/w in question at the
>>> time. I may be able to get it now.
>>
>> I see. The implication then is, as Kevin said, that we mustn't run
>> guests with _any_ IOMMU PTEs having their "read" bits clear.
>> Anything that's "not present" now would need directing to a scratch
>> page. I then further wonder what effect reads to addresses beyond
>> AGAW would have. It may be impossible to arrange for sufficiently
>> secure pass-through with such a device, at which point - again as
>> said by Kevin - there may be little point in the scratch page
>> based quarantining.
>>
> 
> Well, I can't say there's little point in it as it does fix a host lock-up.
> 
> You say "as Kevin said, that we mustn't run guests with _any_ IOMMU
> PTEs having their "read" bits clear"... I can't find that. I did
> find where he said "In concept any IOMMU page table (dom0, dom_io
> or domU) for such bogus device should not include invalid entry",
> but that's a different thing.

In which way?

> However, is a really saying that things will break if any of the
> PTEs has their present bit clear?

Well, you said that read faults are fatal (to the host). Reads will,
for any address with an unpopulated PTE, result in a fault and hence
by implication be fatal.

(As an aside, other than in x86's CPU page tables, IOMMU page tables
in both their AMD and Intel incarnations don't have "present" bits -
they have "read" and "write" ones only.)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 15:44           ` Jan Beulich
@ 2020-03-10 16:05             ` Paul Durrant
  2020-03-10 16:46               ` Jan Beulich
  2020-03-13  3:22               ` Tian, Kevin
  0 siblings, 2 replies; 21+ messages in thread
From: Paul Durrant @ 2020-03-10 16:05 UTC (permalink / raw)
  To: 'Jan Beulich'
  Cc: xen-devel, 'Tian, Kevin', 'Andrew Cooper'

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 10 March 2020 15:44
> To: paul@xen.org
> Cc: xen-devel@lists.xenproject.org; 'Tian, Kevin' <kevin.tian@intel.com>; 'Andrew Cooper'
> <andrew.cooper3@citrix.com>
> Subject: Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> On 10.03.2020 16:13, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 10 March 2020 15:05
> >> To: paul@xen.org
> >> Cc: 'Tian, Kevin' <kevin.tian@intel.com>; xen-devel@lists.xenproject.org; 'Andrew Cooper'
> >> <andrew.cooper3@citrix.com>
> >> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> >>
> >> On 10.03.2020 13:30, Paul Durrant wrote:
> >>>> -----Original Message-----
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: 10 March 2020 10:27
> >>>> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
> >>>> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
> >>>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> >>>>
> >>>> On 10.03.2020 04:43, Tian, Kevin wrote:
> >>>>>> From: Jan Beulich <jbeulich@suse.com>
> >>>>>> Sent: Monday, March 9, 2020 7:09 PM
> >>>>>>
> >>>>>> I'm happy to take better suggestions to replace the "full" command line
> >>>>>> option and Kconfig prompt tokens. I don't think though that "fault" and
> >>>>>> "write-fault" are really suitable there.
> >>>>>
> >>>>> I think we may just allow both r/w access to scratch page for such bogus
> >>>>> device, which may make 'full' more reasonable since we now fully
> >>>>> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
> >>>>> alone for such devices (just because one observed his specific device only
> >>>>> has problem with read-fault).
> >>>>
> >>>> Well, a fundamental problem I have here is that I still don't know
> >>>> the _exact_ conditions for the observed hangs. I consider it unlikely
> >>>> for IOMMU read faults to cause hangs, but for write faults to be
> >>>> "fine".
> >>>
> >>> AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault
> >> being synchronous causes the device's state machine to lock up. It really is observed behaviour.
> >>>
> >>>> It would seem more likely to me that e.g. a non-present
> >>>> context entry might cause issues. If that was the case, we wouldn't
> >>>> need to handle reads and writes differently; we could instead install
> >>>> an all zero top level page table. And we'd still get all faults that
> >>>> are supposed to surface. But perhaps Paul did try this back then, and
> >>>> it turned out to not be an option.
> >>>>
> >>>
> >>> The only info I had was that faults on DMA reads had to avoided
> >>> completely. I did not have access to the h/w in question at the
> >>> time. I may be able to get it now.
> >>
> >> I see. The implication then is, as Kevin said, that we mustn't run
> >> guests with _any_ IOMMU PTEs having their "read" bits clear.
> >> Anything that's "not present" now would need directing to a scratch
> >> page. I then further wonder what effect reads to addresses beyond
> >> AGAW would have. It may be impossible to arrange for sufficiently
> >> secure pass-through with such a device, at which point - again as
> >> said by Kevin - there may be little point in the scratch page
> >> based quarantining.
> >>
> >
> > Well, I can't say there's little point in it as it does fix a host lock-up.
> >
> > You say "as Kevin said, that we mustn't run guests with _any_ IOMMU
> > PTEs having their "read" bits clear"... I can't find that. I did
> > find where he said "In concept any IOMMU page table (dom0, dom_io
> > or domU) for such bogus device should not include invalid entry",
> > but that's a different thing.
> 
> In which way?

In that the PTE would still be a valid entry? It would have read perm clear, yes, but that doesn't make the PTE invalid.

> 
> > However, is a really saying that things will break if any of the
> > PTEs has their present bit clear?
> 
> Well, you said that read faults are fatal (to the host). Reads will,
> for any address with an unpopulated PTE, result in a fault and hence
> by implication be fatal.

Oh I see. I thought there was an implication that the IOMMU could not cope with non-present PTEs in some way. Agreed that, when the device is assigned to the guest, then it can arrange (via ballooning) for a non-present entry to be hit by a read transaction, resulting in a lock-up. But dealing with a malicious guest was not the issue at hand... dealing with a buggy device that still tried to DMA after reset and whilst in quarantine was the problem.

> 
> (As an aside, other than in x86's CPU page tables, IOMMU page tables
> in both their AMD and Intel incarnations don't have "present" bits -
> they have "read" and "write" ones only.)
> 

Oh, ok. I must have misunderstood the purpose of the 'pr' bit in the AMD IOMMU PTE then. Agreed that VT-d does not seem to have an explicit present bit (now that I look at the implementation of dma_pte_present()).

  Paul

> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 16:05             ` Paul Durrant
@ 2020-03-10 16:46               ` Jan Beulich
  2020-03-13  3:22               ` Tian, Kevin
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2020-03-10 16:46 UTC (permalink / raw)
  To: paul; +Cc: xen-devel, 'Tian, Kevin', 'Andrew Cooper'

On 10.03.2020 17:05, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 10 March 2020 15:44
>> To: paul@xen.org
>> Cc: xen-devel@lists.xenproject.org; 'Tian, Kevin' <kevin.tian@intel.com>; 'Andrew Cooper'
>> <andrew.cooper3@citrix.com>
>> Subject: Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>
>> On 10.03.2020 16:13, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 10 March 2020 15:05
>>>> To: paul@xen.org
>>>> Cc: 'Tian, Kevin' <kevin.tian@intel.com>; xen-devel@lists.xenproject.org; 'Andrew Cooper'
>>>> <andrew.cooper3@citrix.com>
>>>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>>>
>>>> On 10.03.2020 13:30, Paul Durrant wrote:
>>>>>> -----Original Message-----
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: 10 March 2020 10:27
>>>>>> To: Tian, Kevin <kevin.tian@intel.com>; Paul Durrant <paul@xen.org>
>>>>>> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
>>>>>>
>>>>>> On 10.03.2020 04:43, Tian, Kevin wrote:
>>>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>>>> Sent: Monday, March 9, 2020 7:09 PM
>>>>>>>>
>>>>>>>> I'm happy to take better suggestions to replace the "full" command line
>>>>>>>> option and Kconfig prompt tokens. I don't think though that "fault" and
>>>>>>>> "write-fault" are really suitable there.
>>>>>>>
>>>>>>> I think we may just allow both r/w access to scratch page for such bogus
>>>>>>> device, which may make 'full' more reasonable since we now fully
>>>>>>> contain in-fly DMAs. I'm not sure about the value of keeping write-fault
>>>>>>> alone for such devices (just because one observed his specific device only
>>>>>>> has problem with read-fault).
>>>>>>
>>>>>> Well, a fundamental problem I have here is that I still don't know
>>>>>> the _exact_ conditions for the observed hangs. I consider it unlikely
>>>>>> for IOMMU read faults to cause hangs, but for write faults to be
>>>>>> "fine".
>>>>>
>>>>> AFAIK it's because the writes are posted and so any faults are just ignored, whereas a read fault
>>>> being synchronous causes the device's state machine to lock up. It really is observed behaviour.
>>>>>
>>>>>> It would seem more likely to me that e.g. a non-present
>>>>>> context entry might cause issues. If that was the case, we wouldn't
>>>>>> need to handle reads and writes differently; we could instead install
>>>>>> an all zero top level page table. And we'd still get all faults that
>>>>>> are supposed to surface. But perhaps Paul did try this back then, and
>>>>>> it turned out to not be an option.
>>>>>>
>>>>>
>>>>> The only info I had was that faults on DMA reads had to avoided
>>>>> completely. I did not have access to the h/w in question at the
>>>>> time. I may be able to get it now.
>>>>
>>>> I see. The implication then is, as Kevin said, that we mustn't run
>>>> guests with _any_ IOMMU PTEs having their "read" bits clear.
>>>> Anything that's "not present" now would need directing to a scratch
>>>> page. I then further wonder what effect reads to addresses beyond
>>>> AGAW would have. It may be impossible to arrange for sufficiently
>>>> secure pass-through with such a device, at which point - again as
>>>> said by Kevin - there may be little point in the scratch page
>>>> based quarantining.
>>>>
>>>
>>> Well, I can't say there's little point in it as it does fix a host lock-up.
>>>
>>> You say "as Kevin said, that we mustn't run guests with _any_ IOMMU
>>> PTEs having their "read" bits clear"... I can't find that. I did
>>> find where he said "In concept any IOMMU page table (dom0, dom_io
>>> or domU) for such bogus device should not include invalid entry",
>>> but that's a different thing.
>>
>> In which way?
> 
> In that the PTE would still be a valid entry? It would have read
> perm clear, yes, but that doesn't make the PTE invalid.

It was my understanding that Kevin meant "invalid" to represent
both "read" and "write" clear.

>>> However, is a really saying that things will break if any of the
>>> PTEs has their present bit clear?
>>
>> Well, you said that read faults are fatal (to the host). Reads will,
>> for any address with an unpopulated PTE, result in a fault and hence
>> by implication be fatal.
> 
> Oh I see. I thought there was an implication that the IOMMU could
> not cope with non-present PTEs in some way.

Well, that's what you were telling me. Or what I understood of what
you were saying.

> Agreed that, when the device is assigned to the guest, then it can
> arrange (via ballooning) for a non-present entry to be hit by a
> read transaction, resulting in a lock-up.

Not just ballooning, but simply by programming a bogus address into
whatever controls the device's DMA operation.

> But dealing with a
> malicious guest was not the issue at hand... dealing with a buggy
> device that still tried to DMA after reset and whilst in quarantine
> was the problem.

Sure, but assigning such a buggy device to a guest is suspicious in
the first place. Sure, if you trust your guest (including it being
bug free) then all you want is for the system to remain stable after
the guest died.

>> (As an aside, other than in x86's CPU page tables, IOMMU page tables
>> in both their AMD and Intel incarnations don't have "present" bits -
>> they have "read" and "write" ones only.)
>>
> 
> Oh, ok. I must have misunderstood the purpose of the 'pr' bit in
> the AMD IOMMU PTE then.

Oh, no - I overlooked it, applying too much VT-d to that code. I'm
sorry.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 10:26   ` Jan Beulich
  2020-03-10 12:30     ` Paul Durrant
@ 2020-03-13  2:28     ` Tian, Kevin
  1 sibling, 0 replies; 21+ messages in thread
From: Tian, Kevin @ 2020-03-13  2:28 UTC (permalink / raw)
  To: Jan Beulich, Paul Durrant; +Cc: xen-devel, Andrew Cooper

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 10, 2020 6:27 PM
> 
> On 10.03.2020 04:43, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 9, 2020 7:09 PM
> >>
> >> I'm happy to take better suggestions to replace the "full" command line
> >> option and Kconfig prompt tokens. I don't think though that "fault" and
> >> "write-fault" are really suitable there.
> >
> > I think we may just allow both r/w access to scratch page for such bogus
> > device, which may make 'full' more reasonable since we now fully
> > contain in-fly DMAs. I'm not sure about the value of keeping write-fault
> > alone for such devices (just because one observed his specific device only
> > has problem with read-fault).
> 
> Well, a fundamental problem I have here is that I still don't know
> the _exact_ conditions for the observed hangs. I consider it unlikely
> for IOMMU read faults to cause hangs, but for write faults to be
> "fine". It would seem more likely to me that e.g. a non-present
> context entry might cause issues. If that was the case, we wouldn't
> need to handle reads and writes differently; we could instead install
> an all zero top level page table. And we'd still get all faults that
> are supposed to surface. But perhaps Paul did try this back then, and
> it turned out to not be an option.
> 
> The choice of letting writes continue to fault was based on (a) this
> having been tested to work on the affected system(s) and (b) also
> letting writes go to a scratch page requiring a per-device scratch
> page (and associated page tables) rather than a system-wide one, as
> devices coming from different domains would otherwise be able to
> observe data written to memory by respectively "foreign" devices
> (and hence domains).

ok, it is a valid point.

> 
> But this is all guesswork without the firmware writers of affected
> systems giving us at least some hints.
> 
> > alternatively I also thought about whether whitelisting the problematic
> > devices through another option (e.g. nofault=b:d:f) could provide more
> > value. In concept any IOMMU page table (dom0, dom_io or domU)
> > for such bogus device should not include invalid entry, even when
> > quarantine is not specified. However I'm not sure whether it's worthy of
> > going so far...
> 
> Indeed. Question though is whether this bad behavior is device specific
> (rather than e.g. system dependent). Plus - as per above - question
> also is whether it's really leaf (or intermediate) page table entry
> presence which actually matters here. If it was, I agree we shouldn't
> have any non-present entries anywhere in the page table trees.
> 
> Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 10:31   ` Jan Beulich
@ 2020-03-13  3:05     ` Tian, Kevin
  2020-03-13  8:10       ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2020-03-13  3:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Paul Durrant, Andrew Cooper

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 10, 2020 6:31 PM
> 
> On 10.03.2020 06:30, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 9, 2020 7:09 PM
> >>
> >> Containing still in flight DMA was introduced to work around certain
> >> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
> >> Passing through (such) devices (on such systems) is inherently insecure
> >> (as guests could easily arrange for IOMMU faults of any kind to occur).
> >> Defaulting to a mode where admins may not even become aware of
> issues
> >> with devices can be considered undesirable. Therefore convert this mode
> >> of operation to an optional one, not one enabled by default.
> >
> > Here is another thought. The whole point of quarantine is to contain
> > the device after it is deassigned from untrusted guest.
> 
> I'd question the "untrusted" here. Assigning devices to untrusted
> guests is problematic anyway, unless you're the device manufacturer
> and device firmware writer, and hence you can guarantee the device
> to not offer any backdoors or alike. Therefore I view quarantining

Aren't all guests untrusted from hypervisor p.o.v, which is the reason
why IOMMU was introduced in the first place? 'untrust' imo applies
to either malicious guest code or inadvertent guest behavior that you 
mentioned below, which may both put the device in a state that the
hypervisor wants to quarantine before moving the device to another
owner. On the other hand, one must have certain degree of trust on 
the device itself, that the hardware won't do bad thing that is outside
of the guest driver control, e.g. for SR-IOV capable device the trust 
is about the device enforcing completed isolation between VFs...

> more as a protection of the host against bad device behavior, and
> less against malicious guest behavior (while the driver in the
> guest surely has some influence, consider the guest getting crashed
> and even a well-behaved driver hence not getting any chance to
> silence the device).

I may overlook the history of quarantine feature. Based on my study
of quarantine related changes, looks initially Paul pointed out such 
problem that a device may have in-fly DMAs to touch dom0 memory
after it is deassigned. Then he introduced the quarantine concept by
putting a quarantined device into dom_io w/o any valid mapping, 
with a whitelist approach. You later extended as a default behavior
for all PCI devices. Now Paul found some device which cannot tolerate
read-fault and then came up this scratch-page idea.

Now I wonder why we are doing such explicit quarantine in the first
place. Shouldn't we always seek resetting the device when it is deassigned
from a guest? 'reset' should cancel/quiescent all in-fly DMAs for most
devices if they implement 'reset' correctly. If doing that way, we don't
need a quarantine option at all, and then the bogus device in Paul's
latest finding could be handled by a standalone option w/o struggling
whether 'full' is a right name vs. 'basic'. or we may introduce a reset
flag when assigning such device to indicate such special requirement,
so a scratch page/dom_io can be linked specifically for such device 
post reset, assuming it is not a platform-level bug from Paul's response?  

> 
> Jan
> 
> > However, the
> > passthrough of such device is already insecure, as you mentioned.
> > Then why do we care about making deassignment of such device
> > secure without doing anything to secure it when it is assigned and being
> > used by untrusted guest? I feel that one should simply put such device
> > out of the quarantine list in the first place, i.e. set quarantine=false and
> > then use tool to quarantine a whitelist of devices by skipping the bad one.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-10 16:05             ` Paul Durrant
  2020-03-10 16:46               ` Jan Beulich
@ 2020-03-13  3:22               ` Tian, Kevin
  2020-03-13  9:26                 ` Paul Durrant
  1 sibling, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2020-03-13  3:22 UTC (permalink / raw)
  To: paul, 'Jan Beulich'; +Cc: xen-devel, 'Andrew Cooper'

> From: Paul Durrant <xadimgnik@gmail.com>
> Sent: Wednesday, March 11, 2020 12:05 AM
> 
[...]
> 
> >
> > > However, is a really saying that things will break if any of the
> > > PTEs has their present bit clear?
> >
> > Well, you said that read faults are fatal (to the host). Reads will,
> > for any address with an unpopulated PTE, result in a fault and hence
> > by implication be fatal.
> 
> Oh I see. I thought there was an implication that the IOMMU could not cope
> with non-present PTEs in some way. Agreed that, when the device is assigned
> to the guest, then it can arrange (via ballooning) for a non-present entry to
> be hit by a read transaction, resulting in a lock-up. But dealing with a
> malicious guest was not the issue at hand... dealing with a buggy device that
> still tried to DMA after reset and whilst in quarantine was the problem.
> 

More thinking on this, I wonder whether the scratch page is sufficient, or
whether we should support such device in the first place. Looking at
0c35d446:
--
    The reason for doing this is that some hardware may continue to re-try
    DMA (despite FLR) in the event of an error, or even BME being cleared, and
    will fail to deal with DMA read faults gracefully. Having a scratch page
    mapped will allow pending DMA reads to complete and thus such buggy
    hardware will eventually be quiesced.
--

'eventually'... what does it exactly mean? How would an user know a 
device has been quiesced before he attempts to re-assign the device
to other domU or dom0? by guess? Note the exact behavior of such
device, after different guest behaviors (hang, kill, bug, etc.), is not
documented. Who knows whether a in-fly DMA may be triggered when
the new owner starts to initialize the device again? How many stale
states are remaining on such device which, even not triggerring in-fly
DMAs, may change the desired behavior of the new owner? e.g. it's
possible one control register configured by the old owner, but not
touched by the new owner. If it cannot be reset, what's the point of
supporting assignment of such bogus device?

Thereby I feel any support of such bogus device should be maintained
offtree, instead of in upstream Xen. Thoughts?

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-13  3:05     ` Tian, Kevin
@ 2020-03-13  8:10       ` Jan Beulich
  2020-03-13  9:26         ` Paul Durrant
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-03-13  8:10 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: xen-devel, Paul Durrant, Andrew Cooper

On 13.03.2020 04:05, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, March 10, 2020 6:31 PM
>>
>> On 10.03.2020 06:30, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Monday, March 9, 2020 7:09 PM
>>>>
>>>> Containing still in flight DMA was introduced to work around certain
>>>> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
>>>> Passing through (such) devices (on such systems) is inherently insecure
>>>> (as guests could easily arrange for IOMMU faults of any kind to occur).
>>>> Defaulting to a mode where admins may not even become aware of
>> issues
>>>> with devices can be considered undesirable. Therefore convert this mode
>>>> of operation to an optional one, not one enabled by default.
>>>
>>> Here is another thought. The whole point of quarantine is to contain
>>> the device after it is deassigned from untrusted guest.
>>
>> I'd question the "untrusted" here. Assigning devices to untrusted
>> guests is problematic anyway, unless you're the device manufacturer
>> and device firmware writer, and hence you can guarantee the device
>> to not offer any backdoors or alike. Therefore I view quarantining
> 
> Aren't all guests untrusted from hypervisor p.o.v, which is the reason
> why IOMMU was introduced in the first place?

"Untrusted" is always meant from the perspective of the host admin.

> I may overlook the history of quarantine feature. Based on my study
> of quarantine related changes, looks initially Paul pointed out such 
> problem that a device may have in-fly DMAs to touch dom0 memory
> after it is deassigned. Then he introduced the quarantine concept by
> putting a quarantined device into dom_io w/o any valid mapping, 
> with a whitelist approach. You later extended as a default behavior
> for all PCI devices. Now Paul found some device which cannot tolerate
> read-fault and then came up this scratch-page idea.
> 
> Now I wonder why we are doing such explicit quarantine in the first
> place. Shouldn't we always seek resetting the device when it is deassigned
> from a guest? 'reset' should cancel/quiescent all in-fly DMAs for most
> devices if they implement 'reset' correctly.

And the important word here is "should". Paul and colleagues found
it may not do so in reality.

> If doing that way, we don't
> need a quarantine option at all, and then the bogus device in Paul's
> latest finding could be handled by a standalone option w/o struggling
> whether 'full' is a right name vs. 'basic'. or we may introduce a reset
> flag when assigning such device to indicate such special requirement,
> so a scratch page/dom_io can be linked specifically for such device 
> post reset, assuming it is not a platform-level bug from Paul's response?  

Which would imply host admins to know such properties of their
devices, and better _without_ first having run into problems.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-13  3:22               ` Tian, Kevin
@ 2020-03-13  9:26                 ` Paul Durrant
  2020-03-17  6:10                   ` Tian, Kevin
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2020-03-13  9:26 UTC (permalink / raw)
  To: 'Tian, Kevin', 'Jan Beulich'
  Cc: xen-devel, 'Andrew Cooper'

> -----Original Message-----
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: 13 March 2020 03:23
> To: paul@xen.org; 'Jan Beulich' <jbeulich@suse.com>
> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>
> Subject: RE: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> > From: Paul Durrant <xadimgnik@gmail.com>
> > Sent: Wednesday, March 11, 2020 12:05 AM
> >
> [...]
> >
> > >
> > > > However, is a really saying that things will break if any of the
> > > > PTEs has their present bit clear?
> > >
> > > Well, you said that read faults are fatal (to the host). Reads will,
> > > for any address with an unpopulated PTE, result in a fault and hence
> > > by implication be fatal.
> >
> > Oh I see. I thought there was an implication that the IOMMU could not cope
> > with non-present PTEs in some way. Agreed that, when the device is assigned
> > to the guest, then it can arrange (via ballooning) for a non-present entry to
> > be hit by a read transaction, resulting in a lock-up. But dealing with a
> > malicious guest was not the issue at hand... dealing with a buggy device that
> > still tried to DMA after reset and whilst in quarantine was the problem.
> >
> 
> More thinking on this, I wonder whether the scratch page is sufficient, or
> whether we should support such device in the first place. Looking at
> 0c35d446:
> --
>     The reason for doing this is that some hardware may continue to re-try
>     DMA (despite FLR) in the event of an error, or even BME being cleared, and
>     will fail to deal with DMA read faults gracefully. Having a scratch page
>     mapped will allow pending DMA reads to complete and thus such buggy
>     hardware will eventually be quiesced.
> --
> 
> 'eventually'... what does it exactly mean?

It means after a period of time we can only determine empirically.

> How would an user know a
> device has been quiesced before he attempts to re-assign the device
> to other domU or dom0? by guess?

Yes, a guess, but an educated one.

> Note the exact behavior of such
> device, after different guest behaviors (hang, kill, bug, etc.), is not
> documented. Who knows whether a in-fly DMA may be triggered when
> the new owner starts to initialize the device again? How many stale
> states are remaining on such device which, even not triggerring in-fly
> DMAs, may change the desired behavior of the new owner? e.g. it's
> possible one control register configured by the old owner, but not
> touched by the new owner. If it cannot be reset, what's the point of
> supporting assignment of such bogus device?
> 

Because I'm afraid it is quite ubiquitous and we need to deal with it.

> Thereby I feel any support of such bogus device should be maintained
> offtree, instead of in upstream Xen. Thoughts?
> 

I don't see the harm in the code being upstream. There may well be other devices with similar issues and it provides an option for an admin to try.

  Paul


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-13  8:10       ` Jan Beulich
@ 2020-03-13  9:26         ` Paul Durrant
  2020-03-17  6:20           ` Tian, Kevin
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2020-03-13  9:26 UTC (permalink / raw)
  To: 'Jan Beulich', 'Tian, Kevin'
  Cc: xen-devel, 'Andrew Cooper'

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 13 March 2020 08:10
> To: Tian, Kevin <kevin.tian@intel.com>
> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant
> <paul@xen.org>
> Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
> 
> On 13.03.2020 04:05, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, March 10, 2020 6:31 PM
> >>
> >> On 10.03.2020 06:30, Tian, Kevin wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Monday, March 9, 2020 7:09 PM
> >>>>
> >>>> Containing still in flight DMA was introduced to work around certain
> >>>> devices / systems hanging hard upon hitting a "not-present" IOMMU fault.
> >>>> Passing through (such) devices (on such systems) is inherently insecure
> >>>> (as guests could easily arrange for IOMMU faults of any kind to occur).
> >>>> Defaulting to a mode where admins may not even become aware of
> >> issues
> >>>> with devices can be considered undesirable. Therefore convert this mode
> >>>> of operation to an optional one, not one enabled by default.
> >>>
> >>> Here is another thought. The whole point of quarantine is to contain
> >>> the device after it is deassigned from untrusted guest.
> >>
> >> I'd question the "untrusted" here. Assigning devices to untrusted
> >> guests is problematic anyway, unless you're the device manufacturer
> >> and device firmware writer, and hence you can guarantee the device
> >> to not offer any backdoors or alike. Therefore I view quarantining
> >
> > Aren't all guests untrusted from hypervisor p.o.v, which is the reason
> > why IOMMU was introduced in the first place?
> 
> "Untrusted" is always meant from the perspective of the host admin.
> 
> > I may overlook the history of quarantine feature. Based on my study
> > of quarantine related changes, looks initially Paul pointed out such
> > problem that a device may have in-fly DMAs to touch dom0 memory
> > after it is deassigned. Then he introduced the quarantine concept by
> > putting a quarantined device into dom_io w/o any valid mapping,
> > with a whitelist approach. You later extended as a default behavior
> > for all PCI devices. Now Paul found some device which cannot tolerate
> > read-fault and then came up this scratch-page idea.
> >
> > Now I wonder why we are doing such explicit quarantine in the first
> > place. Shouldn't we always seek resetting the device when it is deassigned
> > from a guest? 'reset' should cancel/quiescent all in-fly DMAs for most
> > devices if they implement 'reset' correctly.
> 
> And the important word here is "should". Paul and colleagues found
> it may not do so in reality.

Yeah... we have to live with what we've got. Yes, it's buggy as hell but we have to do our best to stop it wedging hosts whilst trying to prevent scribbling over critical parts of memory.

> 
> > If doing that way, we don't
> > need a quarantine option at all, and then the bogus device in Paul's
> > latest finding could be handled by a standalone option w/o struggling
> > whether 'full' is a right name vs. 'basic'. or we may introduce a reset
> > flag when assigning such device to indicate such special requirement,
> > so a scratch page/dom_io can be linked specifically for such device
> > post reset, assuming it is not a platform-level bug from Paul's response?
> 
> Which would imply host admins to know such properties of their
> devices, and better _without_ first having run into problems.
> 

It is a device-level bug. We could, I guess, have a per-device quirk to say whether it should get a context entry pointing at a scratch page or not.

  Paul


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-13  9:26                 ` Paul Durrant
@ 2020-03-17  6:10                   ` Tian, Kevin
  0 siblings, 0 replies; 21+ messages in thread
From: Tian, Kevin @ 2020-03-17  6:10 UTC (permalink / raw)
  To: paul, 'Jan Beulich'; +Cc: xen-devel, 'Andrew Cooper'

> From: Paul Durrant <xadimgnik@gmail.com>
> Sent: Friday, March 13, 2020 5:26 PM
> 
> > -----Original Message-----
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: 13 March 2020 03:23
> > To: paul@xen.org; 'Jan Beulich' <jbeulich@suse.com>
> > Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper'
> <andrew.cooper3@citrix.com>
> > Subject: RE: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of
> quarantined devices optional
> >
> > > From: Paul Durrant <xadimgnik@gmail.com>
> > > Sent: Wednesday, March 11, 2020 12:05 AM
> > >
> > [...]
> > >
> > > >
> > > > > However, is a really saying that things will break if any of the
> > > > > PTEs has their present bit clear?
> > > >
> > > > Well, you said that read faults are fatal (to the host). Reads will,
> > > > for any address with an unpopulated PTE, result in a fault and hence
> > > > by implication be fatal.
> > >
> > > Oh I see. I thought there was an implication that the IOMMU could not
> cope
> > > with non-present PTEs in some way. Agreed that, when the device is
> assigned
> > > to the guest, then it can arrange (via ballooning) for a non-present entry
> to
> > > be hit by a read transaction, resulting in a lock-up. But dealing with a
> > > malicious guest was not the issue at hand... dealing with a buggy device
> that
> > > still tried to DMA after reset and whilst in quarantine was the problem.
> > >
> >
> > More thinking on this, I wonder whether the scratch page is sufficient, or
> > whether we should support such device in the first place. Looking at
> > 0c35d446:
> > --
> >     The reason for doing this is that some hardware may continue to re-try
> >     DMA (despite FLR) in the event of an error, or even BME being cleared,
> and
> >     will fail to deal with DMA read faults gracefully. Having a scratch page
> >     mapped will allow pending DMA reads to complete and thus such buggy
> >     hardware will eventually be quiesced.
> > --
> >
> > 'eventually'... what does it exactly mean?
> 
> It means after a period of time we can only determine empirically.
> 
> > How would an user know a
> > device has been quiesced before he attempts to re-assign the device
> > to other domU or dom0? by guess?
> 
> Yes, a guess, but an educated one.
> 
> > Note the exact behavior of such
> > device, after different guest behaviors (hang, kill, bug, etc.), is not
> > documented. Who knows whether a in-fly DMA may be triggered when
> > the new owner starts to initialize the device again? How many stale
> > states are remaining on such device which, even not triggerring in-fly
> > DMAs, may change the desired behavior of the new owner? e.g. it's
> > possible one control register configured by the old owner, but not
> > touched by the new owner. If it cannot be reset, what's the point of
> > supporting assignment of such bogus device?
> >
> 
> Because I'm afraid it is quite ubiquitous and we need to deal with it.

it sounds the whole passthrough is in dangerous if your statement is true...

> 
> > Thereby I feel any support of such bogus device should be maintained
> > offtree, instead of in upstream Xen. Thoughts?
> >
> 
> I don't see the harm in the code being upstream. There may well be other
> devices with similar issues and it provides an option for an admin to try.
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional
  2020-03-13  9:26         ` Paul Durrant
@ 2020-03-17  6:20           ` Tian, Kevin
  0 siblings, 0 replies; 21+ messages in thread
From: Tian, Kevin @ 2020-03-17  6:20 UTC (permalink / raw)
  To: paul, 'Jan Beulich'; +Cc: xen-devel, 'Andrew Cooper'

> From: Paul Durrant <xadimgnik@gmail.com>
> Sent: Friday, March 13, 2020 5:26 PM
> 
> > -----Original Message-----
> > From: Jan Beulich <jbeulich@suse.com>
> > Sent: 13 March 2020 08:10
> > To: Tian, Kevin <kevin.tian@intel.com>
> > Cc: xen-devel@lists.xenproject.org; Andrew Cooper
> <andrew.cooper3@citrix.com>; Paul Durrant
> > <paul@xen.org>
> > Subject: Re: [PATCH v3] IOMMU: make DMA containment of quarantined
> devices optional
> >
> > On 13.03.2020 04:05, Tian, Kevin wrote:
> > >> From: Jan Beulich <jbeulich@suse.com>
> > >> Sent: Tuesday, March 10, 2020 6:31 PM
> > >>
> > >> On 10.03.2020 06:30, Tian, Kevin wrote:
> > >>>> From: Jan Beulich <jbeulich@suse.com>
> > >>>> Sent: Monday, March 9, 2020 7:09 PM
> > >>>>
> > >>>> Containing still in flight DMA was introduced to work around certain
> > >>>> devices / systems hanging hard upon hitting a "not-present" IOMMU
> fault.
> > >>>> Passing through (such) devices (on such systems) is inherently
> insecure
> > >>>> (as guests could easily arrange for IOMMU faults of any kind to occur).
> > >>>> Defaulting to a mode where admins may not even become aware of
> > >> issues
> > >>>> with devices can be considered undesirable. Therefore convert this
> mode
> > >>>> of operation to an optional one, not one enabled by default.
> > >>>
> > >>> Here is another thought. The whole point of quarantine is to contain
> > >>> the device after it is deassigned from untrusted guest.
> > >>
> > >> I'd question the "untrusted" here. Assigning devices to untrusted
> > >> guests is problematic anyway, unless you're the device manufacturer
> > >> and device firmware writer, and hence you can guarantee the device
> > >> to not offer any backdoors or alike. Therefore I view quarantining
> > >
> > > Aren't all guests untrusted from hypervisor p.o.v, which is the reason
> > > why IOMMU was introduced in the first place?
> >
> > "Untrusted" is always meant from the perspective of the host admin.
> >
> > > I may overlook the history of quarantine feature. Based on my study
> > > of quarantine related changes, looks initially Paul pointed out such
> > > problem that a device may have in-fly DMAs to touch dom0 memory
> > > after it is deassigned. Then he introduced the quarantine concept by
> > > putting a quarantined device into dom_io w/o any valid mapping,
> > > with a whitelist approach. You later extended as a default behavior
> > > for all PCI devices. Now Paul found some device which cannot tolerate
> > > read-fault and then came up this scratch-page idea.
> > >
> > > Now I wonder why we are doing such explicit quarantine in the first
> > > place. Shouldn't we always seek resetting the device when it is deassigned
> > > from a guest? 'reset' should cancel/quiescent all in-fly DMAs for most
> > > devices if they implement 'reset' correctly.
> >
> > And the important word here is "should". Paul and colleagues found
> > it may not do so in reality.
> 
> Yeah... we have to live with what we've got. Yes, it's buggy as hell but we
> have to do our best to stop it wedging hosts whilst trying to prevent
> scribbling over critical parts of memory.

'should' is applied to most devices who can gracefully handle the reset
request, and then for exception we come with ad-hoc band-aid.

> 
> >
> > > If doing that way, we don't
> > > need a quarantine option at all, and then the bogus device in Paul's
> > > latest finding could be handled by a standalone option w/o struggling
> > > whether 'full' is a right name vs. 'basic'. or we may introduce a reset
> > > flag when assigning such device to indicate such special requirement,
> > > so a scratch page/dom_io can be linked specifically for such device
> > > post reset, assuming it is not a platform-level bug from Paul's response?
> >
> > Which would imply host admins to know such properties of their
> > devices, and better _without_ first having run into problems.
> >
> 
> It is a device-level bug. We could, I guess, have a per-device quirk to say
> whether it should get a context entry pointing at a scratch page or not.
> 

If all except me think supporting such device in upstream is necessary, 
per-device quirk at least sounds better to me than enforcing a global 
quarantine policy for all devices.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-03-17  6:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 11:09 [Xen-devel] [PATCH v3] IOMMU: make DMA containment of quarantined devices optional Jan Beulich
2020-03-10  3:43 ` Tian, Kevin
2020-03-10 10:26   ` Jan Beulich
2020-03-10 12:30     ` Paul Durrant
2020-03-10 15:04       ` Jan Beulich
2020-03-10 15:13         ` Paul Durrant
2020-03-10 15:44           ` Jan Beulich
2020-03-10 16:05             ` Paul Durrant
2020-03-10 16:46               ` Jan Beulich
2020-03-13  3:22               ` Tian, Kevin
2020-03-13  9:26                 ` Paul Durrant
2020-03-17  6:10                   ` Tian, Kevin
2020-03-13  2:28     ` Tian, Kevin
2020-03-10  5:30 ` Tian, Kevin
2020-03-10 10:31   ` Jan Beulich
2020-03-13  3:05     ` Tian, Kevin
2020-03-13  8:10       ` Jan Beulich
2020-03-13  9:26         ` Paul Durrant
2020-03-17  6:20           ` Tian, Kevin
2020-03-10  8:58 ` Paul Durrant
2020-03-10 10:10   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.