linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
@ 2018-12-05  2:19 Chao Gao
  2018-12-05  9:32 ` Roger Pau Monné
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Chao Gao @ 2018-12-05  2:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chao Gao, Boris Ostrovsky, Juergen Gross, Stefano Stabellini,
	Jia-Ju Bai, xen-devel, Roger Pau Monné,
	Jan Beulich

I find some pass-thru devices don't work any more across guest reboot.
Assigning it to another guest also meets the same issue. And the only
way to make it work again is un-binding and binding it to pciback.
Someone reported this issue one year ago [1]. More detail also can be
found in [2].

The root-cause is Xen's internal MSI-X state isn't reset properly
during reboot or re-assignment. In the above case, Xen set maskall bit
to mask all MSI interrupts after it detected a potential security
issue. Even after device reset, Xen didn't reset its internal maskall
bit. As a result, maskall bit would be set again in next write to
MSI-X message control register.

Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
internal state of a device, we employ it to fix this issue rather than
introducing another dedicated sub-hypercall.

Note that PHYSDEVOPS_release_msix() will fail if the mapping between
the device's msix and pirq has been created. This limitation prevents
us calling this function when detaching a device from a guest during
guest shutdown. Thus it is called right before calling
PHYSDEVOPS_prepare_msix().

[1]: https://lists.xenproject.org/archives/html/xen-devel/2017-09/
     msg02520.html
[2]: https://lists.xen.org/archives/html/xen-devel/2018-11/msg01616.html

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 drivers/xen/xen-pciback/pci_stub.c | 49 ++++++++++++++++++++++++++++++++++++++
 drivers/xen/xen-pciback/pciback.h  |  1 +
 drivers/xen/xen-pciback/xenbus.c   | 10 ++++++++
 3 files changed, 60 insertions(+)

diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index 59661db..f8623d0 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -87,6 +87,55 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
 	return psdev;
 }
 
+/*
+ * Reset Xen internal MSI-X state by invoking PHYSDEVOP_{release, prepare}_msix.
+ */
+int pcistub_msix_reset(struct pci_dev *dev)
+{
+#ifdef CONFIG_PCI_MSI
+	if (dev->msix_cap) {
+		struct physdev_pci_device ppdev = {
+			.seg = pci_domain_nr(dev->bus),
+			.bus = dev->bus->number,
+			.devfn = dev->devfn
+		};
+		int err;
+		u16 val;
+
+		/*
+		 * Do a write first to flush Xen's internal state to hardware
+		 * such that the following read can infer whether MSI-X maskall
+		 * bit is set by Xen.
+		 */
+		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
+		pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, val);
+
+		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
+		if (!(val & PCI_MSIX_FLAGS_MASKALL))
+			return 0;
+
+		pr_info("Reset MSI-X state for device %04x:%02x:%02x.%d\n",
+			ppdev.seg, ppdev.bus, PCI_SLOT(ppdev.devfn),
+			PCI_FUNC(ppdev.devfn));
+
+		err = HYPERVISOR_physdev_op(PHYSDEVOP_release_msix, &ppdev);
+		if (err) {
+			dev_warn(&dev->dev, "MSI-X release failed (%d)\n",
+				 err);
+			return err;
+		}
+
+		err = HYPERVISOR_physdev_op(PHYSDEVOP_prepare_msix, &ppdev);
+		if (err) {
+			dev_err(&dev->dev, "MSI-X preparation failed (%d)\n",
+				err);
+			return err;
+		}
+	}
+#endif
+	return 0;
+}
+
 /* Don't call this directly as it's called by pcistub_device_put */
 static void pcistub_device_release(struct kref *kref)
 {
diff --git a/drivers/xen/xen-pciback/pciback.h b/drivers/xen/xen-pciback/pciback.h
index 263c059..9046154 100644
--- a/drivers/xen/xen-pciback/pciback.h
+++ b/drivers/xen/xen-pciback/pciback.h
@@ -66,6 +66,7 @@ struct pci_dev *pcistub_get_pci_dev_by_slot(struct xen_pcibk_device *pdev,
 struct pci_dev *pcistub_get_pci_dev(struct xen_pcibk_device *pdev,
 				    struct pci_dev *dev);
 void pcistub_put_pci_dev(struct pci_dev *dev);
+int pcistub_msix_reset(struct pci_dev *dev);
 
 /* Ensure a device is turned off or reset */
 void xen_pcibk_reset_device(struct pci_dev *pdev);
diff --git a/drivers/xen/xen-pciback/xenbus.c b/drivers/xen/xen-pciback/xenbus.c
index 581c4e1..2f71f26 100644
--- a/drivers/xen/xen-pciback/xenbus.c
+++ b/drivers/xen/xen-pciback/xenbus.c
@@ -243,6 +243,16 @@ static int xen_pcibk_export_device(struct xen_pcibk_device *pdev,
 		goto out;
 	}
 
+	/*
+	 * Reset Xen's internal MSI-X state before exposing a device.
+	 *
+	 * In some cases, Xen's internal MSI-X state is not clean, which would
+	 * incur the new guest cannot receive MSIs.
+	 */
+	err = pcistub_msix_reset(dev);
+	if (err)
+		goto out;
+
 	err = xen_pcibk_add_pci_dev(pdev, dev, devid,
 				    xen_pcibk_publish_pci_dev);
 	if (err)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05  2:19 [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device Chao Gao
@ 2018-12-05  9:32 ` Roger Pau Monné
  2018-12-05 14:01   ` Boris Ostrovsky
  2018-12-06  2:18   ` Chao Gao
  2018-12-05 16:26 ` Jan Beulich
  2019-09-13 10:02 ` Spassov, Stanislav
  2 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2018-12-05  9:32 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-kernel, Boris Ostrovsky, Juergen Gross, Stefano Stabellini,
	Jia-Ju Bai, xen-devel, Jan Beulich

On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
> I find some pass-thru devices don't work any more across guest reboot.
> Assigning it to another guest also meets the same issue. And the only
> way to make it work again is un-binding and binding it to pciback.
> Someone reported this issue one year ago [1]. More detail also can be
> found in [2].
> 
> The root-cause is Xen's internal MSI-X state isn't reset properly
> during reboot or re-assignment. In the above case, Xen set maskall bit
> to mask all MSI interrupts after it detected a potential security
> issue. Even after device reset, Xen didn't reset its internal maskall
> bit. As a result, maskall bit would be set again in next write to
> MSI-X message control register.
> 
> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
> internal state of a device, we employ it to fix this issue rather than
> introducing another dedicated sub-hypercall.
> 
> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
> the device's msix and pirq has been created. This limitation prevents
> us calling this function when detaching a device from a guest during
> guest shutdown. Thus it is called right before calling
> PHYSDEVOPS_prepare_msix().

s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
() at the end of the hypercall name since it's not a function.

I'm also wondering why the release can't be done when the device is
detached from the guest (or the guest has been shut down). This makes
me worry about the raciness of the attach/detach procedure: if there's
a state where pciback assumes the device has been detached from the
guest, but there are still pirqs bound, an attempt to attach to
another guest in such state will fail.

> [1]: https://lists.xenproject.org/archives/html/xen-devel/2017-09/
>      msg02520.html
> [2]: https://lists.xen.org/archives/html/xen-devel/2018-11/msg01616.html
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  drivers/xen/xen-pciback/pci_stub.c | 49 ++++++++++++++++++++++++++++++++++++++
>  drivers/xen/xen-pciback/pciback.h  |  1 +
>  drivers/xen/xen-pciback/xenbus.c   | 10 ++++++++
>  3 files changed, 60 insertions(+)
> 
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index 59661db..f8623d0 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -87,6 +87,55 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
>  	return psdev;
>  }
>  
> +/*
> + * Reset Xen internal MSI-X state by invoking PHYSDEVOP_{release, prepare}_msix.
> + */
> +int pcistub_msix_reset(struct pci_dev *dev)
> +{
> +#ifdef CONFIG_PCI_MSI
> +	if (dev->msix_cap) {
> +		struct physdev_pci_device ppdev = {
> +			.seg = pci_domain_nr(dev->bus),
> +			.bus = dev->bus->number,
> +			.devfn = dev->devfn
> +		};
> +		int err;
> +		u16 val;
> +
> +		/*
> +		 * Do a write first to flush Xen's internal state to hardware
> +		 * such that the following read can infer whether MSI-X maskall
> +		 * bit is set by Xen.
> +		 */
> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
> +		pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, val);
> +
> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
> +		if (!(val & PCI_MSIX_FLAGS_MASKALL))
> +			return 0;

I would just perform a reset regardless of the maskall value, which
would also allow you to skip the read/write dance that you do above.

ATM we are only concerned about the maskall bit, but there's no reason
why prepare/release couldn't do more stuff in the future.

> +
> +		pr_info("Reset MSI-X state for device %04x:%02x:%02x.%d\n",
> +			ppdev.seg, ppdev.bus, PCI_SLOT(ppdev.devfn),
> +			PCI_FUNC(ppdev.devfn));
> +
> +		err = HYPERVISOR_physdev_op(PHYSDEVOP_release_msix, &ppdev);
> +		if (err) {
> +			dev_warn(&dev->dev, "MSI-X release failed (%d)\n",
> +				 err);

This is a warn, while the message below is an err, any reason for
the difference in log level?

> +			return err;
> +		}
> +
> +		err = HYPERVISOR_physdev_op(PHYSDEVOP_prepare_msix, &ppdev);
> +		if (err) {
> +			dev_err(&dev->dev, "MSI-X preparation failed (%d)\n",
> +				err);
> +			return err;

Thanks, Roger.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05  9:32 ` Roger Pau Monné
@ 2018-12-05 14:01   ` Boris Ostrovsky
  2018-12-12  7:06     ` Chao Gao
  2018-12-06  2:18   ` Chao Gao
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2018-12-05 14:01 UTC (permalink / raw)
  To: Roger Pau Monné, Chao Gao
  Cc: linux-kernel, Juergen Gross, Stefano Stabellini, Jia-Ju Bai,
	xen-devel, Jan Beulich

On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>> I find some pass-thru devices don't work any more across guest reboot.
>> Assigning it to another guest also meets the same issue. And the only
>> way to make it work again is un-binding and binding it to pciback.
>> Someone reported this issue one year ago [1]. More detail also can be
>> found in [2].
>>
>> The root-cause is Xen's internal MSI-X state isn't reset properly
>> during reboot or re-assignment. In the above case, Xen set maskall bit
>> to mask all MSI interrupts after it detected a potential security
>> issue. Even after device reset, Xen didn't reset its internal maskall
>> bit. As a result, maskall bit would be set again in next write to
>> MSI-X message control register.
>>
>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>> internal state of a device, we employ it to fix this issue rather than
>> introducing another dedicated sub-hypercall.
>>
>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>> the device's msix and pirq has been created. This limitation prevents
>> us calling this function when detaching a device from a guest during
>> guest shutdown. Thus it is called right before calling
>> PHYSDEVOPS_prepare_msix().
> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
> () at the end of the hypercall name since it's not a function.
>
> I'm also wondering why the release can't be done when the device is
> detached from the guest (or the guest has been shut down). This makes
> me worry about the raciness of the attach/detach procedure: if there's
> a state where pciback assumes the device has been detached from the
> guest, but there are still pirqs bound, an attempt to attach to
> another guest in such state will fail.

I wonder whether this additional reset functionality could be done out
of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
and then do the extra things that are not properly done there.

-boris



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05  2:19 [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device Chao Gao
  2018-12-05  9:32 ` Roger Pau Monné
@ 2018-12-05 16:26 ` Jan Beulich
  2019-09-13 10:02 ` Spassov, Stanislav
  2 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2018-12-05 16:26 UTC (permalink / raw)
  To: Chao Gao
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

>>> On 05.12.18 at 03:19, <chao.gao@intel.com> wrote:
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -87,6 +87,55 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
>  	return psdev;
>  }
>  
> +/*
> + * Reset Xen internal MSI-X state by invoking PHYSDEVOP_{release, prepare}_msix.
> + */
> +int pcistub_msix_reset(struct pci_dev *dev)
> +{
> +#ifdef CONFIG_PCI_MSI
> +	if (dev->msix_cap) {
> +		struct physdev_pci_device ppdev = {
> +			.seg = pci_domain_nr(dev->bus),
> +			.bus = dev->bus->number,
> +			.devfn = dev->devfn
> +		};
> +		int err;
> +		u16 val;
> +
> +		/*
> +		 * Do a write first to flush Xen's internal state to hardware
> +		 * such that the following read can infer whether MSI-X maskall
> +		 * bit is set by Xen.
> +		 */
> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
> +		pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, val);
> +
> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
> +		if (!(val & PCI_MSIX_FLAGS_MASKALL))
> +			return 0;

I'm agreeing with prior comments: I don't see why you need
this conditional.

> +		pr_info("Reset MSI-X state for device %04x:%02x:%02x.%d\n",
> +			ppdev.seg, ppdev.bus, PCI_SLOT(ppdev.devfn),
> +			PCI_FUNC(ppdev.devfn));
> +
> +		err = HYPERVISOR_physdev_op(PHYSDEVOP_release_msix, &ppdev);
> +		if (err) {
> +			dev_warn(&dev->dev, "MSI-X release failed (%d)\n",
> +				 err);
> +			return err;
> +		}
> +
> +		err = HYPERVISOR_physdev_op(PHYSDEVOP_prepare_msix, &ppdev);
> +		if (err) {
> +			dev_err(&dev->dev, "MSI-X preparation failed (%d)\n",
> +				err);

Please can you make both log messages distinguishable from
the pre-existing ones, to aid diagnosis of possible probelms?

Jan



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05  9:32 ` Roger Pau Monné
  2018-12-05 14:01   ` Boris Ostrovsky
@ 2018-12-06  2:18   ` Chao Gao
  1 sibling, 0 replies; 18+ messages in thread
From: Chao Gao @ 2018-12-06  2:18 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: linux-kernel, Boris Ostrovsky, Juergen Gross, Stefano Stabellini,
	Jia-Ju Bai, xen-devel, Jan Beulich

On Wed, Dec 05, 2018 at 10:32:23AM +0100, Roger Pau Monné wrote:
>On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>> I find some pass-thru devices don't work any more across guest reboot.
>> Assigning it to another guest also meets the same issue. And the only
>> way to make it work again is un-binding and binding it to pciback.
>> Someone reported this issue one year ago [1]. More detail also can be
>> found in [2].
>> 
>> The root-cause is Xen's internal MSI-X state isn't reset properly
>> during reboot or re-assignment. In the above case, Xen set maskall bit
>> to mask all MSI interrupts after it detected a potential security
>> issue. Even after device reset, Xen didn't reset its internal maskall
>> bit. As a result, maskall bit would be set again in next write to
>> MSI-X message control register.
>> 
>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>> internal state of a device, we employ it to fix this issue rather than
>> introducing another dedicated sub-hypercall.
>> 
>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>> the device's msix and pirq has been created. This limitation prevents
>> us calling this function when detaching a device from a guest during
>> guest shutdown. Thus it is called right before calling
>> PHYSDEVOPS_prepare_msix().
>
>s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>() at the end of the hypercall name since it's not a function.

Will do.

>
>I'm also wondering why the release can't be done when the device is
>detached from the guest (or the guest has been shut down). This makes
>me worry about the raciness of the attach/detach procedure: if there's
>a state where pciback assumes the device has been detached from the
>guest, but there are still pirqs bound, an attempt to attach to
>another guest in such state will fail.

I think your concern is valid. This is the exact case in which Xen sets
maskall flag. Qemu didn't do msix cleanup (it crashed or guest didn't
request qemu to do this. Anyway, we couldn't rely on qemu), leaving
pirqs bound. pciback doesn't manage pirqs so it doesn't know how to do
cleanup for msix. Then pciback does device reset and disables
memory decoding. After detaching a device from the domain, Xen can do
further cleanup for a closing VM, including freeing pirqs of the VM, but
saw memory decoding disabled. Hence, the maskall flag is set.

>
>> [1]: https://lists.xenproject.org/archives/html/xen-devel/2017-09/
>>      msg02520.html
>> [2]: https://lists.xen.org/archives/html/xen-devel/2018-11/msg01616.html
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>>  drivers/xen/xen-pciback/pci_stub.c | 49 ++++++++++++++++++++++++++++++++++++++
>>  drivers/xen/xen-pciback/pciback.h  |  1 +
>>  drivers/xen/xen-pciback/xenbus.c   | 10 ++++++++
>>  3 files changed, 60 insertions(+)
>> 
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index 59661db..f8623d0 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -87,6 +87,55 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
>>  	return psdev;
>>  }
>>  
>> +/*
>> + * Reset Xen internal MSI-X state by invoking PHYSDEVOP_{release, prepare}_msix.
>> + */
>> +int pcistub_msix_reset(struct pci_dev *dev)
>> +{
>> +#ifdef CONFIG_PCI_MSI
>> +	if (dev->msix_cap) {
>> +		struct physdev_pci_device ppdev = {
>> +			.seg = pci_domain_nr(dev->bus),
>> +			.bus = dev->bus->number,
>> +			.devfn = dev->devfn
>> +		};
>> +		int err;
>> +		u16 val;
>> +
>> +		/*
>> +		 * Do a write first to flush Xen's internal state to hardware
>> +		 * such that the following read can infer whether MSI-X maskall
>> +		 * bit is set by Xen.
>> +		 */
>> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
>> +		pci_write_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, val);
>> +
>> +		pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &val);
>> +		if (!(val & PCI_MSIX_FLAGS_MASKALL))
>> +			return 0;
>
>I would just perform a reset regardless of the maskall value, which
>would also allow you to skip the read/write dance that you do above.
>
>ATM we are only concerned about the maskall bit, but there's no reason
>why prepare/release couldn't do more stuff in the future.
>
>> +
>> +		pr_info("Reset MSI-X state for device %04x:%02x:%02x.%d\n",
>> +			ppdev.seg, ppdev.bus, PCI_SLOT(ppdev.devfn),
>> +			PCI_FUNC(ppdev.devfn));
>> +
>> +		err = HYPERVISOR_physdev_op(PHYSDEVOP_release_msix, &ppdev);
>> +		if (err) {
>> +			dev_warn(&dev->dev, "MSI-X release failed (%d)\n",
>> +				 err);
>
>This is a warn, while the message below is an err, any reason for
>the difference in log level?
>

Will fix this. I just copy code snippte from the existing call site.
and also make them distinguishable per Jan's suggestion.

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05 14:01   ` Boris Ostrovsky
@ 2018-12-12  7:06     ` Chao Gao
  2018-12-12  8:51       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2018-12-12  7:06 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Roger Pau Monné,
	linux-kernel, Juergen Gross, Stefano Stabellini, Jia-Ju Bai,
	xen-devel, Jan Beulich

On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>> I find some pass-thru devices don't work any more across guest reboot.
>>> Assigning it to another guest also meets the same issue. And the only
>>> way to make it work again is un-binding and binding it to pciback.
>>> Someone reported this issue one year ago [1]. More detail also can be
>>> found in [2].
>>>
>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>> to mask all MSI interrupts after it detected a potential security
>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>> bit. As a result, maskall bit would be set again in next write to
>>> MSI-X message control register.
>>>
>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>> internal state of a device, we employ it to fix this issue rather than
>>> introducing another dedicated sub-hypercall.
>>>
>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>> the device's msix and pirq has been created. This limitation prevents
>>> us calling this function when detaching a device from a guest during
>>> guest shutdown. Thus it is called right before calling
>>> PHYSDEVOPS_prepare_msix().
>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>> () at the end of the hypercall name since it's not a function.
>>
>> I'm also wondering why the release can't be done when the device is
>> detached from the guest (or the guest has been shut down). This makes
>> me worry about the raciness of the attach/detach procedure: if there's
>> a state where pciback assumes the device has been detached from the
>> guest, but there are still pirqs bound, an attempt to attach to
>> another guest in such state will fail.
>
>I wonder whether this additional reset functionality could be done out
>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>and then do the extra things that are not properly done there.

No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
without error. But ATM, xen expects that no msi is bound to pirq when
doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
at last minute, which happens after device reset in xen_pcibk_xenbus_remove().

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-12  7:06     ` Chao Gao
@ 2018-12-12  8:51       ` Jan Beulich
  2018-12-12 15:18         ` Chao Gao
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2018-12-12  8:51 UTC (permalink / raw)
  To: Chao Gao
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>> Assigning it to another guest also meets the same issue. And the only
>>>> way to make it work again is un-binding and binding it to pciback.
>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>> found in [2].
>>>>
>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>> to mask all MSI interrupts after it detected a potential security
>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>> bit. As a result, maskall bit would be set again in next write to
>>>> MSI-X message control register.
>>>>
>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>> internal state of a device, we employ it to fix this issue rather than
>>>> introducing another dedicated sub-hypercall.
>>>>
>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>> the device's msix and pirq has been created. This limitation prevents
>>>> us calling this function when detaching a device from a guest during
>>>> guest shutdown. Thus it is called right before calling
>>>> PHYSDEVOPS_prepare_msix().
>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>> () at the end of the hypercall name since it's not a function.
>>>
>>> I'm also wondering why the release can't be done when the device is
>>> detached from the guest (or the guest has been shut down). This makes
>>> me worry about the raciness of the attach/detach procedure: if there's
>>> a state where pciback assumes the device has been detached from the
>>> guest, but there are still pirqs bound, an attempt to attach to
>>> another guest in such state will fail.
>>
>>I wonder whether this additional reset functionality could be done out
>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>and then do the extra things that are not properly done there.
> 
> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
> without error. But ATM, xen expects that no msi is bound to pirq when
> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
> at last minute, which happens after device reset in 
> xen_pcibk_xenbus_remove().

But that may need taking care of: I don't think it is a good idea to have
anything left from the prior owning domain when the device gets reset.
I.e. left over IRQ bindings should perhaps be forcibly cleared before
invoking the reset; in fact I'd expect this to happen in the course of
domain destruction, and I'd expect the device reset to come after the
domain was cleaned up. Perhaps simply an ordering issue in the tool
stack?

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-12  8:51       ` Jan Beulich
@ 2018-12-12 15:18         ` Chao Gao
  2018-12-12 15:21           ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2018-12-12 15:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>> found in [2].
>>>>>
>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>> to mask all MSI interrupts after it detected a potential security
>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>> MSI-X message control register.
>>>>>
>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>> introducing another dedicated sub-hypercall.
>>>>>
>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>> us calling this function when detaching a device from a guest during
>>>>> guest shutdown. Thus it is called right before calling
>>>>> PHYSDEVOPS_prepare_msix().
>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>> () at the end of the hypercall name since it's not a function.
>>>>
>>>> I'm also wondering why the release can't be done when the device is
>>>> detached from the guest (or the guest has been shut down). This makes
>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>> a state where pciback assumes the device has been detached from the
>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>> another guest in such state will fail.
>>>
>>>I wonder whether this additional reset functionality could be done out
>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>and then do the extra things that are not properly done there.
>> 
>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>> without error. But ATM, xen expects that no msi is bound to pirq when
>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>> at last minute, which happens after device reset in 
>> xen_pcibk_xenbus_remove().
>
>But that may need taking care of: I don't think it is a good idea to have
>anything left from the prior owning domain when the device gets reset.
>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>invoking the reset;

Agree. How about pciback to track the established IRQ bindings? Then
pciback can clear irq binding before invoking the reset.

>in fact I'd expect this to happen in the course of
>domain destruction, and I'd expect the device reset to come after the
>domain was cleaned up. Perhaps simply an ordering issue in the tool
>stack?

I don't think reversing the sequences of device reset and domain
destruction would be simple. Furthermore, during device hot-unplug,
device reset is done when the owner is alive. So if we use domain
destruction to enforce all irq binding cleared, in theory, it won't be
applicable to hot-unplug case (if qemu's hot-unplug logic is
compromised).

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-12 15:18         ` Chao Gao
@ 2018-12-12 15:21           ` Jan Beulich
  2018-12-13  3:46             ` Chao Gao
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2018-12-12 15:21 UTC (permalink / raw)
  To: Chao Gao
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>> found in [2].
>>>>>>
>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>> MSI-X message control register.
>>>>>>
>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>> introducing another dedicated sub-hypercall.
>>>>>>
>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>> us calling this function when detaching a device from a guest during
>>>>>> guest shutdown. Thus it is called right before calling
>>>>>> PHYSDEVOPS_prepare_msix().
>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>> () at the end of the hypercall name since it's not a function.
>>>>>
>>>>> I'm also wondering why the release can't be done when the device is
>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>> a state where pciback assumes the device has been detached from the
>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>> another guest in such state will fail.
>>>>
>>>>I wonder whether this additional reset functionality could be done out
>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>and then do the extra things that are not properly done there.
>>> 
>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>> at last minute, which happens after device reset in 
>>> xen_pcibk_xenbus_remove().
>>
>>But that may need taking care of: I don't think it is a good idea to have
>>anything left from the prior owning domain when the device gets reset.
>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>invoking the reset;
> 
> Agree. How about pciback to track the established IRQ bindings? Then
> pciback can clear irq binding before invoking the reset.

How would pciback even know of those mappings, when it's qemu
who establishes (and manages) them?

>>in fact I'd expect this to happen in the course of
>>domain destruction, and I'd expect the device reset to come after the
>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>stack?
> 
> I don't think reversing the sequences of device reset and domain
> destruction would be simple. Furthermore, during device hot-unplug,
> device reset is done when the owner is alive. So if we use domain
> destruction to enforce all irq binding cleared, in theory, it won't be
> applicable to hot-unplug case (if qemu's hot-unplug logic is
> compromised).

Even in the hot-unplug case the tool stack could issue unbind
requests, behind the back of the possibly compromised qemu,
once neither the guest nor qemu have access to the device
anymore.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-12 15:21           ` Jan Beulich
@ 2018-12-13  3:46             ` Chao Gao
  2018-12-13  7:54               ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2018-12-13  3:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>> found in [2].
>>>>>>>
>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>> MSI-X message control register.
>>>>>>>
>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>
>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>
>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>> a state where pciback assumes the device has been detached from the
>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>> another guest in such state will fail.
>>>>>
>>>>>I wonder whether this additional reset functionality could be done out
>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>and then do the extra things that are not properly done there.
>>>> 
>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>> at last minute, which happens after device reset in 
>>>> xen_pcibk_xenbus_remove().
>>>
>>>But that may need taking care of: I don't think it is a good idea to have
>>>anything left from the prior owning domain when the device gets reset.
>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>invoking the reset;
>> 
>> Agree. How about pciback to track the established IRQ bindings? Then
>> pciback can clear irq binding before invoking the reset.
>
>How would pciback even know of those mappings, when it's qemu
>who establishes (and manages) them?

I meant to expose some interfaces from pciback. And pciback serves
as the proxy of IRQ (un)binding APIs.

>
>>>in fact I'd expect this to happen in the course of
>>>domain destruction, and I'd expect the device reset to come after the
>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>stack?
>> 
>> I don't think reversing the sequences of device reset and domain
>> destruction would be simple. Furthermore, during device hot-unplug,
>> device reset is done when the owner is alive. So if we use domain
>> destruction to enforce all irq binding cleared, in theory, it won't be
>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>> compromised).
>
>Even in the hot-unplug case the tool stack could issue unbind
>requests, behind the back of the possibly compromised qemu,
>once neither the guest nor qemu have access to the device
>anymore.

But currently, tool stack doesn't know the remaining IRQ bindings.
If tool stack can maintaine IRQ binding information of a pass-thru
device (stored in Xenstore?), we can come up with a clean solution
without modifying linux kernel and Xen.

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-13  3:46             ` Chao Gao
@ 2018-12-13  7:54               ` Jan Beulich
  2018-12-13 13:17                 ` Chao Gao
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2018-12-13  7:54 UTC (permalink / raw)
  To: Chao Gao
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>> found in [2].
>>>>>>>>
>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>> MSI-X message control register.
>>>>>>>>
>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>
>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>
>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>> another guest in such state will fail.
>>>>>>
>>>>>>I wonder whether this additional reset functionality could be done out
>>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>and then do the extra things that are not properly done there.
>>>>> 
>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>> at last minute, which happens after device reset in 
>>>>> xen_pcibk_xenbus_remove().
>>>>
>>>>But that may need taking care of: I don't think it is a good idea to have
>>>>anything left from the prior owning domain when the device gets reset.
>>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>invoking the reset;
>>> 
>>> Agree. How about pciback to track the established IRQ bindings? Then
>>> pciback can clear irq binding before invoking the reset.
>>
>>How would pciback even know of those mappings, when it's qemu
>>who establishes (and manages) them?
> 
> I meant to expose some interfaces from pciback. And pciback serves
> as the proxy of IRQ (un)binding APIs.

If at all possible we should avoid having to change more parties (qemu,
libxc, kernel, hypervisor) than really necessary. Remember that such
a bug fix may want backporting, and making sure affected people have
all relevant components updated is increasingly difficult with their
number growing.

>>>>in fact I'd expect this to happen in the course of
>>>>domain destruction, and I'd expect the device reset to come after the
>>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>stack?
>>> 
>>> I don't think reversing the sequences of device reset and domain
>>> destruction would be simple. Furthermore, during device hot-unplug,
>>> device reset is done when the owner is alive. So if we use domain
>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>> compromised).
>>
>>Even in the hot-unplug case the tool stack could issue unbind
>>requests, behind the back of the possibly compromised qemu,
>>once neither the guest nor qemu have access to the device
>>anymore.
> 
> But currently, tool stack doesn't know the remaining IRQ bindings.
> If tool stack can maintaine IRQ binding information of a pass-thru
> device (stored in Xenstore?), we can come up with a clean solution
> without modifying linux kernel and Xen.

If there's no way for the tool stack to either find out the bindings
or "blindly" issue unbind requests (accepting them to fail), then a
"wildcard" unbind operation may want adding. Or, perhaps even
better, XEN_DOMCTL_deassign_device could unbind anything left
in place for the specified device. I actually wonder why that's not
already the case.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-13  7:54               ` Jan Beulich
@ 2018-12-13 13:17                 ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2018-12-13 13:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Jia-Ju Bai, Stefano Stabellini, xen-devel,
	Boris Ostrovsky, Juergen Gross, linux-kernel

On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>>> found in [2].
>>>>>>>>>
>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>>> MSI-X message control register.
>>>>>>>>>
>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>>
>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>>
>>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>>> another guest in such state will fail.
>>>>>>>
>>>>>>>I wonder whether this additional reset functionality could be done out
>>>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>>and then do the extra things that are not properly done there.
>>>>>> 
>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>>> at last minute, which happens after device reset in 
>>>>>> xen_pcibk_xenbus_remove().
>>>>>
>>>>>But that may need taking care of: I don't think it is a good idea to have
>>>>>anything left from the prior owning domain when the device gets reset.
>>>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>>invoking the reset;
>>>> 
>>>> Agree. How about pciback to track the established IRQ bindings? Then
>>>> pciback can clear irq binding before invoking the reset.
>>>
>>>How would pciback even know of those mappings, when it's qemu
>>>who establishes (and manages) them?
>> 
>> I meant to expose some interfaces from pciback. And pciback serves
>> as the proxy of IRQ (un)binding APIs.
>
>If at all possible we should avoid having to change more parties (qemu,
>libxc, kernel, hypervisor) than really necessary. Remember that such
>a bug fix may want backporting, and making sure affected people have
>all relevant components updated is increasingly difficult with their
>number growing.
>
>>>>>in fact I'd expect this to happen in the course of
>>>>>domain destruction, and I'd expect the device reset to come after the
>>>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>>stack?
>>>> 
>>>> I don't think reversing the sequences of device reset and domain
>>>> destruction would be simple. Furthermore, during device hot-unplug,
>>>> device reset is done when the owner is alive. So if we use domain
>>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>>> compromised).
>>>
>>>Even in the hot-unplug case the tool stack could issue unbind
>>>requests, behind the back of the possibly compromised qemu,
>>>once neither the guest nor qemu have access to the device
>>>anymore.
>> 
>> But currently, tool stack doesn't know the remaining IRQ bindings.
>> If tool stack can maintaine IRQ binding information of a pass-thru
>> device (stored in Xenstore?), we can come up with a clean solution
>> without modifying linux kernel and Xen.
>
>If there's no way for the tool stack to either find out the bindings
>or "blindly" issue unbind requests (accepting them to fail), then a
>"wildcard" unbind operation may want adding. Or, perhaps even
>better, XEN_DOMCTL_deassign_device could unbind anything left
>in place for the specified device.

Good idea. I will take this advice.

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2018-12-05  2:19 [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device Chao Gao
  2018-12-05  9:32 ` Roger Pau Monné
  2018-12-05 16:26 ` Jan Beulich
@ 2019-09-13 10:02 ` Spassov, Stanislav
  2019-09-13 15:28   ` Chao Gao
  2 siblings, 1 reply; 18+ messages in thread
From: Spassov, Stanislav @ 2019-09-13 10:02 UTC (permalink / raw)
  To: linux-kernel, chao.gao
  Cc: boris.ostrovsky, baijiaju1990, Woodhouse, David, sstabellini,
	jgross, roger.pau, jbeulich, xen-devel

On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>>>> found in [2].
>>>>>>>>>>
>>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>>>> MSI-X message control register.
>>>>>>>>>>
>>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>>>
>>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>>>
>>>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>>>> another guest in such state will fail.
>>>>>>>>
>>>>>>>>I wonder whether this additional reset functionality could be done out
>>>>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>>>and then do the extra things that are not properly done there.
>>>>>>> 
>>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>>>> at last minute, which happens after device reset in 
>>>>>>> xen_pcibk_xenbus_remove().
>>>>>>
>>>>>>But that may need taking care of: I don't think it is a good idea to have
>>>>>>anything left from the prior owning domain when the device gets reset.
>>>>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>>>invoking the reset;
>>>>> 
>>>>> Agree. How about pciback to track the established IRQ bindings? Then
>>>>> pciback can clear irq binding before invoking the reset.
>>>>
>>>>How would pciback even know of those mappings, when it's qemu
>>>>who establishes (and manages) them?
>>> 
>>> I meant to expose some interfaces from pciback. And pciback serves
>>> as the proxy of IRQ (un)binding APIs.
>>
>>If at all possible we should avoid having to change more parties (qemu,
>>libxc, kernel, hypervisor) than really necessary. Remember that such
>>a bug fix may want backporting, and making sure affected people have
>>all relevant components updated is increasingly difficult with their
>>number growing.
>>
>>>>>>in fact I'd expect this to happen in the course of
>>>>>>domain destruction, and I'd expect the device reset to come after the
>>>>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>>>stack?
>>>>> 
>>>>> I don't think reversing the sequences of device reset and domain
>>>>> destruction would be simple. Furthermore, during device hot-unplug,
>>>>> device reset is done when the owner is alive. So if we use domain
>>>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>>>> compromised).
>>>>
>>>>Even in the hot-unplug case the tool stack could issue unbind
>>>>requests, behind the back of the possibly compromised qemu,
>>>>once neither the guest nor qemu have access to the device
>>>>anymore.
>>> 
>>> But currently, tool stack doesn't know the remaining IRQ bindings.
>>> If tool stack can maintaine IRQ binding information of a pass-thru
>>> device (stored in Xenstore?), we can come up with a clean solution
>>> without modifying linux kernel and Xen.
>>
>>If there's no way for the tool stack to either find out the bindings
>>or "blindly" issue unbind requests (accepting them to fail), then a
>>"wildcard" unbind operation may want adding. Or, perhaps even
>>better, XEN_DOMCTL_deassign_device could unbind anything left
>>in place for the specified device.
>
>Good idea. I will take this advice.
>
>Thanks
>Chao

I am having the same issue, and cannot find a fix in either xen-pciback or the Xen codebase.
Was a solution ever pushed as a result of this thread?

Best,
Stanislav



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2019-09-13 10:02 ` Spassov, Stanislav
@ 2019-09-13 15:28   ` Chao Gao
  2019-09-26 10:13     ` [Xen-devel] " Pasi Kärkkäinen
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2019-09-13 15:28 UTC (permalink / raw)
  To: Spassov, Stanislav
  Cc: linux-kernel, boris.ostrovsky, baijiaju1990, Woodhouse, David,
	sstabellini, jgross, roger.pau, jbeulich, xen-devel

On Fri, Sep 13, 2019 at 10:02:24AM +0000, Spassov, Stanislav wrote:
>On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>>>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
>>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>>>>> found in [2].
>>>>>>>>>>>
>>>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>>>>> MSI-X message control register.
>>>>>>>>>>>
>>>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>>>>
>>>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>>>>
>>>>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>>>>> another guest in such state will fail.
>>>>>>>>>
>>>>>>>>>I wonder whether this additional reset functionality could be done out
>>>>>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>>>>and then do the extra things that are not properly done there.
>>>>>>>> 
>>>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>>>>> at last minute, which happens after device reset in 
>>>>>>>> xen_pcibk_xenbus_remove().
>>>>>>>
>>>>>>>But that may need taking care of: I don't think it is a good idea to have
>>>>>>>anything left from the prior owning domain when the device gets reset.
>>>>>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>>>>invoking the reset;
>>>>>> 
>>>>>> Agree. How about pciback to track the established IRQ bindings? Then
>>>>>> pciback can clear irq binding before invoking the reset.
>>>>>
>>>>>How would pciback even know of those mappings, when it's qemu
>>>>>who establishes (and manages) them?
>>>> 
>>>> I meant to expose some interfaces from pciback. And pciback serves
>>>> as the proxy of IRQ (un)binding APIs.
>>>
>>>If at all possible we should avoid having to change more parties (qemu,
>>>libxc, kernel, hypervisor) than really necessary. Remember that such
>>>a bug fix may want backporting, and making sure affected people have
>>>all relevant components updated is increasingly difficult with their
>>>number growing.
>>>
>>>>>>>in fact I'd expect this to happen in the course of
>>>>>>>domain destruction, and I'd expect the device reset to come after the
>>>>>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>>>>stack?
>>>>>> 
>>>>>> I don't think reversing the sequences of device reset and domain
>>>>>> destruction would be simple. Furthermore, during device hot-unplug,
>>>>>> device reset is done when the owner is alive. So if we use domain
>>>>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>>>>> compromised).
>>>>>
>>>>>Even in the hot-unplug case the tool stack could issue unbind
>>>>>requests, behind the back of the possibly compromised qemu,
>>>>>once neither the guest nor qemu have access to the device
>>>>>anymore.
>>>> 
>>>> But currently, tool stack doesn't know the remaining IRQ bindings.
>>>> If tool stack can maintaine IRQ binding information of a pass-thru
>>>> device (stored in Xenstore?), we can come up with a clean solution
>>>> without modifying linux kernel and Xen.
>>>
>>>If there's no way for the tool stack to either find out the bindings
>>>or "blindly" issue unbind requests (accepting them to fail), then a
>>>"wildcard" unbind operation may want adding. Or, perhaps even
>>>better, XEN_DOMCTL_deassign_device could unbind anything left
>>>in place for the specified device.
>>
>>Good idea. I will take this advice.
>>
>>Thanks
>>Chao
>
>I am having the same issue, and cannot find a fix in either xen-pciback or the Xen codebase.
>Was a solution ever pushed as a result of this thread?
>

I submitted patches [1] to Xen community. But I didn't get it merged.
We made a change in device driver to disable MSI-X during guest OS
shutdown to mitigate the issue. But when guest or qemu was crashed, we
encountered this issue again. I have no plan to get back to these
patches. But if you want to fix the issue completely along what the
patches below did, please go ahead.

[1]: https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01227.html

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2019-09-13 15:28   ` Chao Gao
@ 2019-09-26 10:13     ` Pasi Kärkkäinen
  2019-09-26 10:54       ` Spassov, Stanislav
  2020-01-17 18:57       ` Rich Persaud
  0 siblings, 2 replies; 18+ messages in thread
From: Pasi Kärkkäinen @ 2019-09-26 10:13 UTC (permalink / raw)
  To: Spassov, Stanislav
  Cc: Chao Gao, jgross, sstabellini, linux-kernel, baijiaju1990,
	jbeulich, xen-devel, boris.ostrovsky, Woodhouse, David,
	roger.pau

Hello Stanislav,

On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
> On Fri, Sep 13, 2019 at 10:02:24AM +0000, Spassov, Stanislav wrote:
> >On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
> >>On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
> >>>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
> >>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
> >>>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
> >>>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
> >>>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
> >>>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
> >>>>>>>>>On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> >>>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
> >>>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
> >>>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
> >>>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
> >>>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
> >>>>>>>>>>> found in [2].
> >>>>>>>>>>>
> >>>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
> >>>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
> >>>>>>>>>>> to mask all MSI interrupts after it detected a potential security
> >>>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
> >>>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
> >>>>>>>>>>> MSI-X message control register.
> >>>>>>>>>>>
> >>>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
> >>>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
> >>>>>>>>>>> introducing another dedicated sub-hypercall.
> >>>>>>>>>>>
> >>>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
> >>>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
> >>>>>>>>>>> us calling this function when detaching a device from a guest during
> >>>>>>>>>>> guest shutdown. Thus it is called right before calling
> >>>>>>>>>>> PHYSDEVOPS_prepare_msix().
> >>>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
> >>>>>>>>>> () at the end of the hypercall name since it's not a function.
> >>>>>>>>>>
> >>>>>>>>>> I'm also wondering why the release can't be done when the device is
> >>>>>>>>>> detached from the guest (or the guest has been shut down). This makes
> >>>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
> >>>>>>>>>> a state where pciback assumes the device has been detached from the
> >>>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
> >>>>>>>>>> another guest in such state will fail.
> >>>>>>>>>
> >>>>>>>>>I wonder whether this additional reset functionality could be done out
> >>>>>>>>>of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
> >>>>>>>>>and then do the extra things that are not properly done there.
> >>>>>>>> 
> >>>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
> >>>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
> >>>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
> >>>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
> >>>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
> >>>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
> >>>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
> >>>>>>>> at last minute, which happens after device reset in 
> >>>>>>>> xen_pcibk_xenbus_remove().
> >>>>>>>
> >>>>>>>But that may need taking care of: I don't think it is a good idea to have
> >>>>>>>anything left from the prior owning domain when the device gets reset.
> >>>>>>>I.e. left over IRQ bindings should perhaps be forcibly cleared before
> >>>>>>>invoking the reset;
> >>>>>> 
> >>>>>> Agree. How about pciback to track the established IRQ bindings? Then
> >>>>>> pciback can clear irq binding before invoking the reset.
> >>>>>
> >>>>>How would pciback even know of those mappings, when it's qemu
> >>>>>who establishes (and manages) them?
> >>>> 
> >>>> I meant to expose some interfaces from pciback. And pciback serves
> >>>> as the proxy of IRQ (un)binding APIs.
> >>>
> >>>If at all possible we should avoid having to change more parties (qemu,
> >>>libxc, kernel, hypervisor) than really necessary. Remember that such
> >>>a bug fix may want backporting, and making sure affected people have
> >>>all relevant components updated is increasingly difficult with their
> >>>number growing.
> >>>
> >>>>>>>in fact I'd expect this to happen in the course of
> >>>>>>>domain destruction, and I'd expect the device reset to come after the
> >>>>>>>domain was cleaned up. Perhaps simply an ordering issue in the tool
> >>>>>>>stack?
> >>>>>> 
> >>>>>> I don't think reversing the sequences of device reset and domain
> >>>>>> destruction would be simple. Furthermore, during device hot-unplug,
> >>>>>> device reset is done when the owner is alive. So if we use domain
> >>>>>> destruction to enforce all irq binding cleared, in theory, it won't be
> >>>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
> >>>>>> compromised).
> >>>>>
> >>>>>Even in the hot-unplug case the tool stack could issue unbind
> >>>>>requests, behind the back of the possibly compromised qemu,
> >>>>>once neither the guest nor qemu have access to the device
> >>>>>anymore.
> >>>> 
> >>>> But currently, tool stack doesn't know the remaining IRQ bindings.
> >>>> If tool stack can maintaine IRQ binding information of a pass-thru
> >>>> device (stored in Xenstore?), we can come up with a clean solution
> >>>> without modifying linux kernel and Xen.
> >>>
> >>>If there's no way for the tool stack to either find out the bindings
> >>>or "blindly" issue unbind requests (accepting them to fail), then a
> >>>"wildcard" unbind operation may want adding. Or, perhaps even
> >>>better, XEN_DOMCTL_deassign_device could unbind anything left
> >>>in place for the specified device.
> >>
> >>Good idea. I will take this advice.
> >>
> >>Thanks
> >>Chao
> >
> >I am having the same issue, and cannot find a fix in either xen-pciback or the Xen codebase.
> >Was a solution ever pushed as a result of this thread?
> >
> 
> I submitted patches [1] to Xen community. But I didn't get it merged.
> We made a change in device driver to disable MSI-X during guest OS
> shutdown to mitigate the issue. But when guest or qemu was crashed, we
> encountered this issue again. I have no plan to get back to these
> patches. But if you want to fix the issue completely along what the
> patches below did, please go ahead.
> 
> [1]: https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01227.html
> 
> Thanks
> Chao
> 

Stanislav: Are you able to continue the work with these patches, to get them merged? 


Thanks,

-- Pasi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2019-09-26 10:13     ` [Xen-devel] " Pasi Kärkkäinen
@ 2019-09-26 10:54       ` Spassov, Stanislav
  2020-01-17 18:57       ` Rich Persaud
  1 sibling, 0 replies; 18+ messages in thread
From: Spassov, Stanislav @ 2019-09-26 10:54 UTC (permalink / raw)
  To: pasik
  Cc: boris.ostrovsky, linux-kernel, baijiaju1990, Woodhouse, David,
	sstabellini, jgross, roger.pau, chao.gao, jbeulich, xen-devel

Hello Pasi,

Unfortunately, I am not able to continue the work on the Xen patches in
the foreseeable future.

For what it's worth: the xen-pciback workaround from this thread solves
my current issue as confirmed by internal testing.

-- Stanislav

(apologies for ugly footer injected below by company SMTP server
due to local laws)

On Thu, 2019-09-26 at 13:13 +0300, Pasi Kärkkäinen wrote:
> Hello Stanislav,
> 
> On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
> > On Fri, Sep 13, 2019 at 10:02:24AM +0000, Spassov, Stanislav wrote:
> > > On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
> > > > On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
> > > > > > > > On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich
> > > > > > wrote:
> > > > > > > > > > On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich
> > > > > > > > wrote:
> > > > > > > > > > > > On 12.12.18 at 08:06, <chao.gao@intel.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris
> > > > > > > > > > Ostrovsky wrote:
> > > > > > > > > > > On 12/5/18 4:32 AM, Roger Pau Monné wrote:
> > > > > > > > > > > > On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao
> > > > > > > > > > > > Gao wrote:
> > > > > > > > > > > > > I find some pass-thru devices don't work any
> > > > > > > > > > > > > more across guest reboot.
> > > > > > > > > > > > > Assigning it to another guest also meets the
> > > > > > > > > > > > > same issue. And the only
> > > > > > > > > > > > > way to make it work again is un-binding and
> > > > > > > > > > > > > binding it to pciback.
> > > > > > > > > > > > > Someone reported this issue one year ago [1].
> > > > > > > > > > > > > More detail also can be
> > > > > > > > > > > > > found in [2].
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The root-cause is Xen's internal MSI-X state
> > > > > > > > > > > > > isn't reset properly
> > > > > > > > > > > > > during reboot or re-assignment. In the above
> > > > > > > > > > > > > case, Xen set maskall bit
> > > > > > > > > > > > > to mask all MSI interrupts after it detected
> > > > > > > > > > > > > a potential security
> > > > > > > > > > > > > issue. Even after device reset, Xen didn't
> > > > > > > > > > > > > reset its internal maskall
> > > > > > > > > > > > > bit. As a result, maskall bit would be set
> > > > > > > > > > > > > again in next write to
> > > > > > > > > > > > > MSI-X message control register.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Given that PHYSDEVOPS_prepare_msix() also
> > > > > > > > > > > > > triggers Xen resetting MSI-X
> > > > > > > > > > > > > internal state of a device, we employ it to
> > > > > > > > > > > > > fix this issue rather than
> > > > > > > > > > > > > introducing another dedicated sub-hypercall.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Note that PHYSDEVOPS_release_msix() will fail
> > > > > > > > > > > > > if the mapping between
> > > > > > > > > > > > > the device's msix and pirq has been created.
> > > > > > > > > > > > > This limitation prevents
> > > > > > > > > > > > > us calling this function when detaching a
> > > > > > > > > > > > > device from a guest during
> > > > > > > > > > > > > guest shutdown. Thus it is called right
> > > > > > > > > > > > > before calling
> > > > > > > > > > > > > PHYSDEVOPS_prepare_msix().
> > > > > > > > > > > > 
> > > > > > > > > > > > s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then
> > > > > > > > > > > > I would also drop the
> > > > > > > > > > > > () at the end of the hypercall name since it's
> > > > > > > > > > > > not a function.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm also wondering why the release can't be
> > > > > > > > > > > > done when the device is
> > > > > > > > > > > > detached from the guest (or the guest has been
> > > > > > > > > > > > shut down). This makes
> > > > > > > > > > > > me worry about the raciness of the
> > > > > > > > > > > > attach/detach procedure: if there's
> > > > > > > > > > > > a state where pciback assumes the device has
> > > > > > > > > > > > been detached from the
> > > > > > > > > > > > guest, but there are still pirqs bound, an
> > > > > > > > > > > > attempt to attach to
> > > > > > > > > > > > another guest in such state will fail.
> > > > > > > > > > > 
> > > > > > > > > > > I wonder whether this additional reset
> > > > > > > > > > > functionality could be done out
> > > > > > > > > > > of xen_pcibk_xenbus_remove(). We first do a (best
> > > > > > > > > > > effort) device reset
> > > > > > > > > > > and then do the extra things that are not
> > > > > > > > > > > properly done there.
> > > > > > > > > > 
> > > > > > > > > > No. It cannot be done in xen_pcibk_xenbus_remove()
> > > > > > > > > > without modifying
> > > > > > > > > > the handler of PHYSDEVOP_release_msix. To do a
> > > > > > > > > > successful Xen internal
> > > > > > > > > > MSI-X state reset, PHYSDEVOP_{release,
> > > > > > > > > > prepare}_msix should be finished
> > > > > > > > > > without error. But ATM, xen expects that no msi is
> > > > > > > > > > bound to pirq when
> > > > > > > > > > doing PHYSDEVOP_release_msix. Otherwise it fails
> > > > > > > > > > with error code -EBUSY.
> > > > > > > > > > However, the expectation isn't guaranteed in
> > > > > > > > > > xen_pcibk_xenbus_remove().
> > > > > > > > > > In some cases, if qemu fails to unmap MSIs, MSIs
> > > > > > > > > > are unmapped by Xen
> > > > > > > > > > at last minute, which happens after device reset
> > > > > > > > > > in 
> > > > > > > > > > xen_pcibk_xenbus_remove().
> > > > > > > > > 
> > > > > > > > > But that may need taking care of: I don't think it is
> > > > > > > > > a good idea to have
> > > > > > > > > anything left from the prior owning domain when the
> > > > > > > > > device gets reset.
> > > > > > > > > I.e. left over IRQ bindings should perhaps be
> > > > > > > > > forcibly cleared before
> > > > > > > > > invoking the reset;
> > > > > > > > 
> > > > > > > > Agree. How about pciback to track the established IRQ
> > > > > > > > bindings? Then
> > > > > > > > pciback can clear irq binding before invoking the
> > > > > > > > reset.
> > > > > > > 
> > > > > > > How would pciback even know of those mappings, when it's
> > > > > > > qemu
> > > > > > > who establishes (and manages) them?
> > > > > > 
> > > > > > I meant to expose some interfaces from pciback. And pciback
> > > > > > serves
> > > > > > as the proxy of IRQ (un)binding APIs.
> > > > > 
> > > > > If at all possible we should avoid having to change more
> > > > > parties (qemu,
> > > > > libxc, kernel, hypervisor) than really necessary. Remember
> > > > > that such
> > > > > a bug fix may want backporting, and making sure affected
> > > > > people have
> > > > > all relevant components updated is increasingly difficult
> > > > > with their
> > > > > number growing.
> > > > > 
> > > > > > > > > in fact I'd expect this to happen in the course of
> > > > > > > > > domain destruction, and I'd expect the device reset
> > > > > > > > > to come after the
> > > > > > > > > domain was cleaned up. Perhaps simply an ordering
> > > > > > > > > issue in the tool
> > > > > > > > > stack?
> > > > > > > > 
> > > > > > > > I don't think reversing the sequences of device reset
> > > > > > > > and domain
> > > > > > > > destruction would be simple. Furthermore, during device
> > > > > > > > hot-unplug,
> > > > > > > > device reset is done when the owner is alive. So if we
> > > > > > > > use domain
> > > > > > > > destruction to enforce all irq binding cleared, in
> > > > > > > > theory, it won't be
> > > > > > > > applicable to hot-unplug case (if qemu's hot-unplug
> > > > > > > > logic is
> > > > > > > > compromised).
> > > > > > > 
> > > > > > > Even in the hot-unplug case the tool stack could issue
> > > > > > > unbind
> > > > > > > requests, behind the back of the possibly compromised
> > > > > > > qemu,
> > > > > > > once neither the guest nor qemu have access to the device
> > > > > > > anymore.
> > > > > > 
> > > > > > But currently, tool stack doesn't know the remaining IRQ
> > > > > > bindings.
> > > > > > If tool stack can maintaine IRQ binding information of a
> > > > > > pass-thru
> > > > > > device (stored in Xenstore?), we can come up with a clean
> > > > > > solution
> > > > > > without modifying linux kernel and Xen.
> > > > > 
> > > > > If there's no way for the tool stack to either find out the
> > > > > bindings
> > > > > or "blindly" issue unbind requests (accepting them to fail),
> > > > > then a
> > > > > "wildcard" unbind operation may want adding. Or, perhaps even
> > > > > better, XEN_DOMCTL_deassign_device could unbind anything left
> > > > > in place for the specified device.
> > > > 
> > > > Good idea. I will take this advice.
> > > > 
> > > > Thanks
> > > > Chao
> > > 
> > > I am having the same issue, and cannot find a fix in either xen-
> > > pciback or the Xen codebase.
> > > Was a solution ever pushed as a result of this thread?
> > > 
> > 
> > I submitted patches [1] to Xen community. But I didn't get it
> > merged.
> > We made a change in device driver to disable MSI-X during guest OS
> > shutdown to mitigate the issue. But when guest or qemu was crashed,
> > we
> > encountered this issue again. I have no plan to get back to these
> > patches. But if you want to fix the issue completely along what the
> > patches below did, please go ahead.
> > 
> > [1]: 
> > https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01227.html
> > 
> > Thanks
> > Chao
> > 
> 
> Stanislav: Are you able to continue the work with these patches, to
> get them merged? 
> 
> 
> Thanks,
> 
> -- Pasi
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2019-09-26 10:13     ` [Xen-devel] " Pasi Kärkkäinen
  2019-09-26 10:54       ` Spassov, Stanislav
@ 2020-01-17 18:57       ` Rich Persaud
  2020-01-18  1:13         ` Chao Gao
  1 sibling, 1 reply; 18+ messages in thread
From: Rich Persaud @ 2020-01-17 18:57 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Spassov, Stanislav, jgross, sstabellini, linux-kernel,
	baijiaju1990, jbeulich, xen-devel, boris.ostrovsky, roger.pau,
	Woodhouse, David, Chao Gao, Marek Marczykowski-Górecki,
	Jason Andryuk

On Sep 26, 2019, at 06:17, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> 
> Hello Stanislav,
> 
>> On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
>>> On Fri, Sep 13, 2019 at 10:02:24AM +0000, Spassov, Stanislav wrote:
>>> On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>>>> On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>>>>>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
>>>>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>>>>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>>>>>> On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>>>>>>> found in [2].
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>>>>>>> MSI-X message control register.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>>>>>>> another guest in such state will fail.
>>>>>>>>>>> 
>>>>>>>>>>> I wonder whether this additional reset functionality could be done out
>>>>>>>>>>> of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>>>>>> and then do the extra things that are not properly done there.
>>>>>>>>>> 
>>>>>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>>>>>>> at last minute, which happens after device reset in 
>>>>>>>>>> xen_pcibk_xenbus_remove().
>>>>>>>>> 
>>>>>>>>> But that may need taking care of: I don't think it is a good idea to have
>>>>>>>>> anything left from the prior owning domain when the device gets reset.
>>>>>>>>> I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>>>>>> invoking the reset;
>>>>>>>> 
>>>>>>>> Agree. How about pciback to track the established IRQ bindings? Then
>>>>>>>> pciback can clear irq binding before invoking the reset.
>>>>>>> 
>>>>>>> How would pciback even know of those mappings, when it's qemu
>>>>>>> who establishes (and manages) them?
>>>>>> 
>>>>>> I meant to expose some interfaces from pciback. And pciback serves
>>>>>> as the proxy of IRQ (un)binding APIs.
>>>>> 
>>>>> If at all possible we should avoid having to change more parties (qemu,
>>>>> libxc, kernel, hypervisor) than really necessary. Remember that such
>>>>> a bug fix may want backporting, and making sure affected people have
>>>>> all relevant components updated is increasingly difficult with their
>>>>> number growing.
>>>>> 
>>>>>>>>> in fact I'd expect this to happen in the course of
>>>>>>>>> domain destruction, and I'd expect the device reset to come after the
>>>>>>>>> domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>>>>>> stack?
>>>>>>>> 
>>>>>>>> I don't think reversing the sequences of device reset and domain
>>>>>>>> destruction would be simple. Furthermore, during device hot-unplug,
>>>>>>>> device reset is done when the owner is alive. So if we use domain
>>>>>>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>>>>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>>>>>>> compromised).
>>>>>>> 
>>>>>>> Even in the hot-unplug case the tool stack could issue unbind
>>>>>>> requests, behind the back of the possibly compromised qemu,
>>>>>>> once neither the guest nor qemu have access to the device
>>>>>>> anymore.
>>>>>> 
>>>>>> But currently, tool stack doesn't know the remaining IRQ bindings.
>>>>>> If tool stack can maintaine IRQ binding information of a pass-thru
>>>>>> device (stored in Xenstore?), we can come up with a clean solution
>>>>>> without modifying linux kernel and Xen.
>>>>> 
>>>>> If there's no way for the tool stack to either find out the bindings
>>>>> or "blindly" issue unbind requests (accepting them to fail), then a
>>>>> "wildcard" unbind operation may want adding. Or, perhaps even
>>>>> better, XEN_DOMCTL_deassign_device could unbind anything left
>>>>> in place for the specified device.
>>>> 
>>>> Good idea. I will take this advice.
>>>> 
>>>> Thanks
>>>> Chao
>>> 
>>> I am having the same issue, and cannot find a fix in either xen-pciback or the Xen codebase.
>>> Was a solution ever pushed as a result of this thread?
>>> 
>> 
>> I submitted patches [1] to Xen community. But I didn't get it merged.
>> We made a change in device driver to disable MSI-X during guest OS
>> shutdown to mitigate the issue. But when guest or qemu was crashed, we
>> encountered this issue again. I have no plan to get back to these
>> patches. But if you want to fix the issue completely along what the
>> patches below did, please go ahead.
>> 
>> [1]: https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01227.html
>> 
>> Thanks
>> Chao
>> 
> 
> Stanislav: Are you able to continue the work with these patches, to get them merged? 

What further work is needed for these patches?  Are they only needed for Intel i210 NIC PCI passthrough, or are other devices affected?

Rich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device
  2020-01-17 18:57       ` Rich Persaud
@ 2020-01-18  1:13         ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2020-01-18  1:13 UTC (permalink / raw)
  To: Rich Persaud
  Cc: Pasi Kärkkäinen, Spassov, Stanislav, jgross,
	sstabellini, linux-kernel, baijiaju1990, jbeulich, xen-devel,
	boris.ostrovsky, roger.pau, Woodhouse, David,
	Marek Marczykowski-Górecki, Jason Andryuk

On Fri, Jan 17, 2020 at 01:57:43PM -0500, Rich Persaud wrote:
>On Sep 26, 2019, at 06:17, Pasi Kärkkäinen <pasik@iki.fi> wrote:
>> 
>> Hello Stanislav,
>> 
>>> On Fri, Sep 13, 2019 at 11:28:20PM +0800, Chao Gao wrote:
>>>> On Fri, Sep 13, 2019 at 10:02:24AM +0000, Spassov, Stanislav wrote:
>>>> On Thu, Dec 13, 2018 at 07:54, Chao Gao wrote:
>>>>> On Thu, Dec 13, 2018 at 12:54:52AM -0700, Jan Beulich wrote:
>>>>>>>>> On 13.12.18 at 04:46, <chao.gao@intel.com> wrote:
>>>>>>> On Wed, Dec 12, 2018 at 08:21:39AM -0700, Jan Beulich wrote:
>>>>>>>>>>> On 12.12.18 at 16:18, <chao.gao@intel.com> wrote:
>>>>>>>>> On Wed, Dec 12, 2018 at 01:51:01AM -0700, Jan Beulich wrote:
>>>>>>>>>>>>> On 12.12.18 at 08:06, <chao.gao@intel.com> wrote:
>>>>>>>>>>> On Wed, Dec 05, 2018 at 09:01:33AM -0500, Boris Ostrovsky wrote:
>>>>>>>>>>>> On 12/5/18 4:32 AM, Roger Pau Monné wrote:
>>>>>>>>>>>>> On Wed, Dec 05, 2018 at 10:19:17AM +0800, Chao Gao wrote:
>>>>>>>>>>>>>> I find some pass-thru devices don't work any more across guest reboot.
>>>>>>>>>>>>>> Assigning it to another guest also meets the same issue. And the only
>>>>>>>>>>>>>> way to make it work again is un-binding and binding it to pciback.
>>>>>>>>>>>>>> Someone reported this issue one year ago [1]. More detail also can be
>>>>>>>>>>>>>> found in [2].
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The root-cause is Xen's internal MSI-X state isn't reset properly
>>>>>>>>>>>>>> during reboot or re-assignment. In the above case, Xen set maskall bit
>>>>>>>>>>>>>> to mask all MSI interrupts after it detected a potential security
>>>>>>>>>>>>>> issue. Even after device reset, Xen didn't reset its internal maskall
>>>>>>>>>>>>>> bit. As a result, maskall bit would be set again in next write to
>>>>>>>>>>>>>> MSI-X message control register.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Given that PHYSDEVOPS_prepare_msix() also triggers Xen resetting MSI-X
>>>>>>>>>>>>>> internal state of a device, we employ it to fix this issue rather than
>>>>>>>>>>>>>> introducing another dedicated sub-hypercall.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Note that PHYSDEVOPS_release_msix() will fail if the mapping between
>>>>>>>>>>>>>> the device's msix and pirq has been created. This limitation prevents
>>>>>>>>>>>>>> us calling this function when detaching a device from a guest during
>>>>>>>>>>>>>> guest shutdown. Thus it is called right before calling
>>>>>>>>>>>>>> PHYSDEVOPS_prepare_msix().
>>>>>>>>>>>>> s/PHYSDEVOPS/PHYSDEVOP/ (no final S). And then I would also drop the
>>>>>>>>>>>>> () at the end of the hypercall name since it's not a function.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm also wondering why the release can't be done when the device is
>>>>>>>>>>>>> detached from the guest (or the guest has been shut down). This makes
>>>>>>>>>>>>> me worry about the raciness of the attach/detach procedure: if there's
>>>>>>>>>>>>> a state where pciback assumes the device has been detached from the
>>>>>>>>>>>>> guest, but there are still pirqs bound, an attempt to attach to
>>>>>>>>>>>>> another guest in such state will fail.
>>>>>>>>>>>> 
>>>>>>>>>>>> I wonder whether this additional reset functionality could be done out
>>>>>>>>>>>> of xen_pcibk_xenbus_remove(). We first do a (best effort) device reset
>>>>>>>>>>>> and then do the extra things that are not properly done there.
>>>>>>>>>>> 
>>>>>>>>>>> No. It cannot be done in xen_pcibk_xenbus_remove() without modifying
>>>>>>>>>>> the handler of PHYSDEVOP_release_msix. To do a successful Xen internal
>>>>>>>>>>> MSI-X state reset, PHYSDEVOP_{release, prepare}_msix should be finished
>>>>>>>>>>> without error. But ATM, xen expects that no msi is bound to pirq when
>>>>>>>>>>> doing PHYSDEVOP_release_msix. Otherwise it fails with error code -EBUSY.
>>>>>>>>>>> However, the expectation isn't guaranteed in xen_pcibk_xenbus_remove().
>>>>>>>>>>> In some cases, if qemu fails to unmap MSIs, MSIs are unmapped by Xen
>>>>>>>>>>> at last minute, which happens after device reset in 
>>>>>>>>>>> xen_pcibk_xenbus_remove().
>>>>>>>>>> 
>>>>>>>>>> But that may need taking care of: I don't think it is a good idea to have
>>>>>>>>>> anything left from the prior owning domain when the device gets reset.
>>>>>>>>>> I.e. left over IRQ bindings should perhaps be forcibly cleared before
>>>>>>>>>> invoking the reset;
>>>>>>>>> 
>>>>>>>>> Agree. How about pciback to track the established IRQ bindings? Then
>>>>>>>>> pciback can clear irq binding before invoking the reset.
>>>>>>>> 
>>>>>>>> How would pciback even know of those mappings, when it's qemu
>>>>>>>> who establishes (and manages) them?
>>>>>>> 
>>>>>>> I meant to expose some interfaces from pciback. And pciback serves
>>>>>>> as the proxy of IRQ (un)binding APIs.
>>>>>> 
>>>>>> If at all possible we should avoid having to change more parties (qemu,
>>>>>> libxc, kernel, hypervisor) than really necessary. Remember that such
>>>>>> a bug fix may want backporting, and making sure affected people have
>>>>>> all relevant components updated is increasingly difficult with their
>>>>>> number growing.
>>>>>> 
>>>>>>>>>> in fact I'd expect this to happen in the course of
>>>>>>>>>> domain destruction, and I'd expect the device reset to come after the
>>>>>>>>>> domain was cleaned up. Perhaps simply an ordering issue in the tool
>>>>>>>>>> stack?
>>>>>>>>> 
>>>>>>>>> I don't think reversing the sequences of device reset and domain
>>>>>>>>> destruction would be simple. Furthermore, during device hot-unplug,
>>>>>>>>> device reset is done when the owner is alive. So if we use domain
>>>>>>>>> destruction to enforce all irq binding cleared, in theory, it won't be
>>>>>>>>> applicable to hot-unplug case (if qemu's hot-unplug logic is
>>>>>>>>> compromised).
>>>>>>>> 
>>>>>>>> Even in the hot-unplug case the tool stack could issue unbind
>>>>>>>> requests, behind the back of the possibly compromised qemu,
>>>>>>>> once neither the guest nor qemu have access to the device
>>>>>>>> anymore.
>>>>>>> 
>>>>>>> But currently, tool stack doesn't know the remaining IRQ bindings.
>>>>>>> If tool stack can maintaine IRQ binding information of a pass-thru
>>>>>>> device (stored in Xenstore?), we can come up with a clean solution
>>>>>>> without modifying linux kernel and Xen.
>>>>>> 
>>>>>> If there's no way for the tool stack to either find out the bindings
>>>>>> or "blindly" issue unbind requests (accepting them to fail), then a
>>>>>> "wildcard" unbind operation may want adding. Or, perhaps even
>>>>>> better, XEN_DOMCTL_deassign_device could unbind anything left
>>>>>> in place for the specified device.
>>>>> 
>>>>> Good idea. I will take this advice.
>>>>> 
>>>>> Thanks
>>>>> Chao
>>>> 
>>>> I am having the same issue, and cannot find a fix in either xen-pciback or the Xen codebase.
>>>> Was a solution ever pushed as a result of this thread?
>>>> 
>>> 
>>> I submitted patches [1] to Xen community. But I didn't get it merged.
>>> We made a change in device driver to disable MSI-X during guest OS
>>> shutdown to mitigate the issue. But when guest or qemu was crashed, we
>>> encountered this issue again. I have no plan to get back to these
>>> patches. But if you want to fix the issue completely along what the
>>> patches below did, please go ahead.
>>> 
>>> [1]: https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01227.html
>>> 
>>> Thanks
>>> Chao
>>> 
>> 
>> Stanislav: Are you able to continue the work with these patches, to get them merged? 
>
>What further work is needed for these patches?  Are they only needed for Intel i210 NIC PCI passthrough, or are other devices affected?

All MSI-X capable devices were affected. This issue is fixed in Xen by Roger's patch
(https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=575e18d54d19eda787f6477a4acd3c50f72751a9).

Thanks
Chao

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-01-18  1:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05  2:19 [PATCH] xen: xen-pciback: Reset MSI-X state when exposing a device Chao Gao
2018-12-05  9:32 ` Roger Pau Monné
2018-12-05 14:01   ` Boris Ostrovsky
2018-12-12  7:06     ` Chao Gao
2018-12-12  8:51       ` Jan Beulich
2018-12-12 15:18         ` Chao Gao
2018-12-12 15:21           ` Jan Beulich
2018-12-13  3:46             ` Chao Gao
2018-12-13  7:54               ` Jan Beulich
2018-12-13 13:17                 ` Chao Gao
2018-12-06  2:18   ` Chao Gao
2018-12-05 16:26 ` Jan Beulich
2019-09-13 10:02 ` Spassov, Stanislav
2019-09-13 15:28   ` Chao Gao
2019-09-26 10:13     ` [Xen-devel] " Pasi Kärkkäinen
2019-09-26 10:54       ` Spassov, Stanislav
2020-01-17 18:57       ` Rich Persaud
2020-01-18  1:13         ` Chao Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).