[PATCH] xen: do not re-use pirq number cached in pci device msi msg data

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
@ 2017-01-05 19:28 Dan Streetman
  2017-01-07  1:06 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-05 19:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-pci, linux-kernel

Do not read a pci device's msi message data to see if a pirq was
previously configured for the device's msi/msix, as the old pirq was
unmapped and may now be in use by another pci device.  The previous
pirq should never be re-used; instead a new pirq should always be
allocated from the hypervisor.

The xen_hvm_setup_msi_irqs() function currently checks the pci device's
msi descriptor message data for each msi/msix vector it sets up, and if
it finds the vector was previously configured with a pirq, and that pirq
is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
from the hypervisor.  However, that pirq was unmapped when the pci device
disabled its msi/msix, and it cannot be re-used; it may have been given
to a different pci device.

This exact situation is happening in a Xen guest where multiple NVMe
controllers (pci devices) are present.  The NVMe driver configures each
pci device's msi/msix twice; first to configure a single vector (to
talk to the controller for its configuration info), and then it disables
that msi/msix and re-configures with all the msi/msix it needs.  When
multiple NVMe controllers are present, this happens concurrently on all
of them, and in the time between controller A calling pci_disable_msix()
and then calling pci_enable_msix_range(), controller B enables its msix
and gets controller A's pirq allocated from the hypervisor.  Then when
controller A re-configures its msix, its first vector tries to re-use
the same pirq that it had before; but that pirq was allocated to
controller B, and thus the Xen event channel for controller A's re-used
pirq fails to map its irq to that pirq; the hypervisor already has the
pirq mapped elsewhere.

Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
---
 arch/x86/pci/xen.c | 23 +++++++----------------
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index bedfab9..a00a6c0 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 		return 1;

 	for_each_pci_msi_entry(msidesc, dev) {
-		__pci_read_msi_msg(msidesc, &msg);
-		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
-			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
-		if (msg.data != XEN_PIRQ_MSI_DATA ||
-		    xen_irq_from_pirq(pirq) < 0) {
-			pirq = xen_allocate_pirq_msi(dev, msidesc);
-			if (pirq < 0) {
-				irq = -ENODEV;
-				goto error;
-			}
-			xen_msi_compose_msg(dev, pirq, &msg);
-			__pci_write_msi_msg(msidesc, &msg);
-			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
-		} else {
-			dev_dbg(&dev->dev,
-				"xen: msi already bound to pirq=%d\n", pirq);
+		pirq = xen_allocate_pirq_msi(dev, msidesc);
+		if (pirq < 0) {
+			irq = -ENODEV;
+			goto error;
 		}
+		xen_msi_compose_msg(dev, pirq, &msg);
+		__pci_write_msi_msg(msidesc, &msg);
+		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
 		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
 					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
 					       (type == PCI_CAP_ID_MSIX) ?
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-05 19:28 [PATCH] xen: do not re-use pirq number cached in pci device msi msg data Dan Streetman
@ 2017-01-07  1:06 ` Konrad Rzeszutek Wilk
  2017-01-09 14:59   ` Boris Ostrovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-01-07  1:06 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Bjorn Helgaas, xen-devel, linux-pci, linux-kernel, boris.ostrovsky

On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> Do not read a pci device's msi message data to see if a pirq was
> previously configured for the device's msi/msix, as the old pirq was
> unmapped and may now be in use by another pci device.  The previous
> pirq should never be re-used; instead a new pirq should always be
> allocated from the hypervisor.

Won't this cause a starvation problem? That is we will run out of PIRQs
as we are not reusing them?
> 
> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
> msi descriptor message data for each msi/msix vector it sets up, and if
> it finds the vector was previously configured with a pirq, and that pirq
> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
> from the hypervisor.  However, that pirq was unmapped when the pci device
> disabled its msi/msix, and it cannot be re-used; it may have been given
> to a different pci device.

Hm, but this implies that we do keep track of it.


while (true)
do
 rmmod nvme
 modprobe nvme
done

Things go boom without this patch. But with this patch does this
still work? As in we don't run out of PIRQs?

Thanks.
> 
> This exact situation is happening in a Xen guest where multiple NVMe
> controllers (pci devices) are present.  The NVMe driver configures each
> pci device's msi/msix twice; first to configure a single vector (to
> talk to the controller for its configuration info), and then it disables
> that msi/msix and re-configures with all the msi/msix it needs.  When
> multiple NVMe controllers are present, this happens concurrently on all
> of them, and in the time between controller A calling pci_disable_msix()
> and then calling pci_enable_msix_range(), controller B enables its msix
> and gets controller A's pirq allocated from the hypervisor.  Then when
> controller A re-configures its msix, its first vector tries to re-use
> the same pirq that it had before; but that pirq was allocated to
> controller B, and thus the Xen event channel for controller A's re-used
> pirq fails to map its irq to that pirq; the hypervisor already has the
> pirq mapped elsewhere.
> 
> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> ---
>  arch/x86/pci/xen.c | 23 +++++++----------------
>  1 file changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index bedfab9..a00a6c0 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>  		return 1;
>  
>  	for_each_pci_msi_entry(msidesc, dev) {
> -		__pci_read_msi_msg(msidesc, &msg);
> -		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> -			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> -		if (msg.data != XEN_PIRQ_MSI_DATA ||
> -		    xen_irq_from_pirq(pirq) < 0) {
> -			pirq = xen_allocate_pirq_msi(dev, msidesc);
> -			if (pirq < 0) {
> -				irq = -ENODEV;
> -				goto error;
> -			}
> -			xen_msi_compose_msg(dev, pirq, &msg);
> -			__pci_write_msi_msg(msidesc, &msg);
> -			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> -		} else {
> -			dev_dbg(&dev->dev,
> -				"xen: msi already bound to pirq=%d\n", pirq);
> +		pirq = xen_allocate_pirq_msi(dev, msidesc);
> +		if (pirq < 0) {
> +			irq = -ENODEV;
> +			goto error;
>  		}
> +		xen_msi_compose_msg(dev, pirq, &msg);
> +		__pci_write_msi_msg(msidesc, &msg);
> +		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>  		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>  					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
>  					       (type == PCI_CAP_ID_MSIX) ?
> -- 
> 2.9.3
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-07  1:06 ` Konrad Rzeszutek Wilk
@ 2017-01-09 14:59   ` Boris Ostrovsky
  2017-01-09 15:42     ` [Xen-devel] " Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Boris Ostrovsky @ 2017-01-09 14:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Dan Streetman
  Cc: Bjorn Helgaas, xen-devel, linux-pci, linux-kernel

On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>> Do not read a pci device's msi message data to see if a pirq was
>> previously configured for the device's msi/msix, as the old pirq was
>> unmapped and may now be in use by another pci device.  The previous
>> pirq should never be re-used; instead a new pirq should always be
>> allocated from the hypervisor.
> Won't this cause a starvation problem? That is we will run out of PIRQs
> as we are not reusing them?

Don't we free the pirq when we unmap it?

-boris

>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>> msi descriptor message data for each msi/msix vector it sets up, and if
>> it finds the vector was previously configured with a pirq, and that pirq
>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>> from the hypervisor.  However, that pirq was unmapped when the pci device
>> disabled its msi/msix, and it cannot be re-used; it may have been given
>> to a different pci device.
> Hm, but this implies that we do keep track of it.
>
>
> while (true)
> do
>  rmmod nvme
>  modprobe nvme
> done
>
> Things go boom without this patch. But with this patch does this
> still work? As in we don't run out of PIRQs?
>
> Thanks.
>> This exact situation is happening in a Xen guest where multiple NVMe
>> controllers (pci devices) are present.  The NVMe driver configures each
>> pci device's msi/msix twice; first to configure a single vector (to
>> talk to the controller for its configuration info), and then it disables
>> that msi/msix and re-configures with all the msi/msix it needs.  When
>> multiple NVMe controllers are present, this happens concurrently on all
>> of them, and in the time between controller A calling pci_disable_msix()
>> and then calling pci_enable_msix_range(), controller B enables its msix
>> and gets controller A's pirq allocated from the hypervisor.  Then when
>> controller A re-configures its msix, its first vector tries to re-use
>> the same pirq that it had before; but that pirq was allocated to
>> controller B, and thus the Xen event channel for controller A's re-used
>> pirq fails to map its irq to that pirq; the hypervisor already has the
>> pirq mapped elsewhere.
>>
>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>> ---
>>  arch/x86/pci/xen.c | 23 +++++++----------------
>>  1 file changed, 7 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> index bedfab9..a00a6c0 100644
>> --- a/arch/x86/pci/xen.c
>> +++ b/arch/x86/pci/xen.c
>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>>  		return 1;
>>  
>>  	for_each_pci_msi_entry(msidesc, dev) {
>> -		__pci_read_msi_msg(msidesc, &msg);
>> -		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> -			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> -		if (msg.data != XEN_PIRQ_MSI_DATA ||
>> -		    xen_irq_from_pirq(pirq) < 0) {
>> -			pirq = xen_allocate_pirq_msi(dev, msidesc);
>> -			if (pirq < 0) {
>> -				irq = -ENODEV;
>> -				goto error;
>> -			}
>> -			xen_msi_compose_msg(dev, pirq, &msg);
>> -			__pci_write_msi_msg(msidesc, &msg);
>> -			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>> -		} else {
>> -			dev_dbg(&dev->dev,
>> -				"xen: msi already bound to pirq=%d\n", pirq);
>> +		pirq = xen_allocate_pirq_msi(dev, msidesc);
>> +		if (pirq < 0) {
>> +			irq = -ENODEV;
>> +			goto error;
>>  		}
>> +		xen_msi_compose_msg(dev, pirq, &msg);
>> +		__pci_write_msi_msg(msidesc, &msg);
>> +		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>  		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>>  					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
>>  					       (type == PCI_CAP_ID_MSIX) ?
>> -- 
>> 2.9.3
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-09 14:59   ` Boris Ostrovsky
@ 2017-01-09 15:42     ` Dan Streetman
  2017-01-09 15:59       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-09 15:42 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>> Do not read a pci device's msi message data to see if a pirq was
>>> previously configured for the device's msi/msix, as the old pirq was
>>> unmapped and may now be in use by another pci device.  The previous
>>> pirq should never be re-used; instead a new pirq should always be
>>> allocated from the hypervisor.
>> Won't this cause a starvation problem? That is we will run out of PIRQs
>> as we are not reusing them?
>
> Don't we free the pirq when we unmap it?

I think this is actually a bit worse than I initially thought.  After
looking a bit closer, and I think there's an asymmetry with pirq
allocation:

tl;dr:

pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
allocated, and reserved in the hypervisor

request_irq() -> an event channel is opened for the specific pirq, and
maps the pirq with the hypervisor

free_irq() -> the event channel is closed, and the pirq is unmapped,
but that unmap function also frees the pirq!  The hypervisor can/will
give it away to the next call to get_free_pirq.  However, the pci
msi/msix data area still contains the pirq number, and the next call
to request_irq() will re-use the pirq.

pci_disable_msix() -> this has no effect on the pirq in the hypervisor
(it's already been freed), and it also doesn't clear anything from the
msi/msix data area, so the pirq is still cached there.

It seems like the hypervisor needs to be fixed to *not* unmap the pirq
when the event channel is closed - or at least, not to change it to
IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
call into something in the Xen guest kernel that actually does the
pirq unmapping, and clear it from the msi data area (i.e.
pci_write_msi_msg)

Alternately, if the hypervisor doesn't change, then the call into the
hypervisor to actually allocate a pirq needs to move from the
pci_enable_msix_range() call to the request_irq() call?  So that when
the msi/msix range is enabled, it doesn't actually reserve any pirq's
for any of the vectors; each request_irq/free_irq pair do the pirq
allocate-and-map/unmap...

longer details:

The chain of function calls starts in the initial call to configure
the msi vectors, which eventually calls __pci_enable_msix_range (or
msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
either tries to re-use any cached pirq in the MSI data area, or (for
the first time setup) allocates a new pirq from the hypervisor via
PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
hypervisor's perspective, but it's not mapped to anything in the guest
kernel.

Then, the driver calls request_irq to actually start using the irq,
which calls __setup_irq to irq_startup to startup_pirq.  The
startup_pirq call actually creates the evtchn and binds it to the
previously allocated pirq via EVTCHNOP_bind_pirq.

At this point, the pirq is bound to a guest kernel evtchn (and irq)
and is in use.  But then, when the driver doesn't want it anymore, it
calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
function closes the evtchn via EVTCHNOP_close.

Inside the hypervisor, in xen/common/event_channel.c in
evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
channel is) then it unmaps the pirq mapping via
unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
to state IRQ_UNBOUND, which makes it available for the hypervisor to
give away to anyone requesting a new pirq!

>
> -boris
>
>>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>>> msi descriptor message data for each msi/msix vector it sets up, and if
>>> it finds the vector was previously configured with a pirq, and that pirq
>>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>>> from the hypervisor.  However, that pirq was unmapped when the pci device
>>> disabled its msi/msix, and it cannot be re-used; it may have been given
>>> to a different pci device.
>> Hm, but this implies that we do keep track of it.
>>
>>
>> while (true)
>> do
>>  rmmod nvme
>>  modprobe nvme
>> done
>>
>> Things go boom without this patch. But with this patch does this
>> still work? As in we don't run out of PIRQs?
>>
>> Thanks.
>>> This exact situation is happening in a Xen guest where multiple NVMe
>>> controllers (pci devices) are present.  The NVMe driver configures each
>>> pci device's msi/msix twice; first to configure a single vector (to
>>> talk to the controller for its configuration info), and then it disables
>>> that msi/msix and re-configures with all the msi/msix it needs.  When
>>> multiple NVMe controllers are present, this happens concurrently on all
>>> of them, and in the time between controller A calling pci_disable_msix()
>>> and then calling pci_enable_msix_range(), controller B enables its msix
>>> and gets controller A's pirq allocated from the hypervisor.  Then when
>>> controller A re-configures its msix, its first vector tries to re-use
>>> the same pirq that it had before; but that pirq was allocated to
>>> controller B, and thus the Xen event channel for controller A's re-used
>>> pirq fails to map its irq to that pirq; the hypervisor already has the
>>> pirq mapped elsewhere.
>>>
>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>> ---
>>>  arch/x86/pci/xen.c | 23 +++++++----------------
>>>  1 file changed, 7 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>>> index bedfab9..a00a6c0 100644
>>> --- a/arch/x86/pci/xen.c
>>> +++ b/arch/x86/pci/xen.c
>>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>>>              return 1;
>>>
>>>      for_each_pci_msi_entry(msidesc, dev) {
>>> -            __pci_read_msi_msg(msidesc, &msg);
>>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
>>> -                xen_irq_from_pirq(pirq) < 0) {
>>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> -                    if (pirq < 0) {
>>> -                            irq = -ENODEV;
>>> -                            goto error;
>>> -                    }
>>> -                    xen_msi_compose_msg(dev, pirq, &msg);
>>> -                    __pci_write_msi_msg(msidesc, &msg);
>>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>> -            } else {
>>> -                    dev_dbg(&dev->dev,
>>> -                            "xen: msi already bound to pirq=%d\n", pirq);
>>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> +            if (pirq < 0) {
>>> +                    irq = -ENODEV;
>>> +                    goto error;
>>>              }
>>> +            xen_msi_compose_msg(dev, pirq, &msg);
>>> +            __pci_write_msi_msg(msidesc, &msg);
>>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
>>>                                             (type == PCI_CAP_ID_MSIX) ?
>>> --
>>> 2.9.3
>>>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-09 15:42     ` [Xen-devel] " Dan Streetman
@ 2017-01-09 15:59       ` Konrad Rzeszutek Wilk
  2017-01-09 19:30         ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-01-09 15:59 UTC (permalink / raw)
  To: Dan Streetman, stefano
  Cc: Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
> > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >>> Do not read a pci device's msi message data to see if a pirq was
> >>> previously configured for the device's msi/msix, as the old pirq was
> >>> unmapped and may now be in use by another pci device.  The previous
> >>> pirq should never be re-used; instead a new pirq should always be
> >>> allocated from the hypervisor.
> >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >> as we are not reusing them?
> >
> > Don't we free the pirq when we unmap it?
> 
> I think this is actually a bit worse than I initially thought.  After
> looking a bit closer, and I think there's an asymmetry with pirq
> allocation:

Lets include Stefano,

Thank you for digging in this! This has quite the deja-vu
feeling as I believe I hit this at some point in the past and
posted some possible ways of fixing this. But sadly I can't
find the thread.
> 
> tl;dr:
> 
> pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
> allocated, and reserved in the hypervisor
> 
> request_irq() -> an event channel is opened for the specific pirq, and
> maps the pirq with the hypervisor
> 
> free_irq() -> the event channel is closed, and the pirq is unmapped,
> but that unmap function also frees the pirq!  The hypervisor can/will
> give it away to the next call to get_free_pirq.  However, the pci
> msi/msix data area still contains the pirq number, and the next call
> to request_irq() will re-use the pirq.
> 
> pci_disable_msix() -> this has no effect on the pirq in the hypervisor
> (it's already been freed), and it also doesn't clear anything from the
> msi/msix data area, so the pirq is still cached there.
> 
> 
> It seems like the hypervisor needs to be fixed to *not* unmap the pirq
> when the event channel is closed - or at least, not to change it to
> IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
> call into something in the Xen guest kernel that actually does the
> pirq unmapping, and clear it from the msi data area (i.e.
> pci_write_msi_msg)

The problem is that Xen changes have sailed a long long time ago.
> 
> Alternately, if the hypervisor doesn't change, then the call into the
> hypervisor to actually allocate a pirq needs to move from the
> pci_enable_msix_range() call to the request_irq() call?  So that when
> the msi/msix range is enabled, it doesn't actually reserve any pirq's
> for any of the vectors; each request_irq/free_irq pair do the pirq
> allocate-and-map/unmap...


Or a third one: We keep an pirq->device lookup and inhibit free_irq()
from actually calling evtchn_close() until the pci_disable_msix() is done?

> 
> 
> longer details:
> 
> The chain of function calls starts in the initial call to configure
> the msi vectors, which eventually calls __pci_enable_msix_range (or
> msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
> either tries to re-use any cached pirq in the MSI data area, or (for
> the first time setup) allocates a new pirq from the hypervisor via
> PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
> hypervisor's perspective, but it's not mapped to anything in the guest
> kernel.
> 
> Then, the driver calls request_irq to actually start using the irq,
> which calls __setup_irq to irq_startup to startup_pirq.  The
> startup_pirq call actually creates the evtchn and binds it to the
> previously allocated pirq via EVTCHNOP_bind_pirq.
> 
> At this point, the pirq is bound to a guest kernel evtchn (and irq)
> and is in use.  But then, when the driver doesn't want it anymore, it
> calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
> function closes the evtchn via EVTCHNOP_close.
> 
> Inside the hypervisor, in xen/common/event_channel.c in
> evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
> channel is) then it unmaps the pirq mapping via
> unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
> to state IRQ_UNBOUND, which makes it available for the hypervisor to
> give away to anyone requesting a new pirq!
> 
> 
> 
> 
> 
> >
> > -boris
> >
> >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
> >>> msi descriptor message data for each msi/msix vector it sets up, and if
> >>> it finds the vector was previously configured with a pirq, and that pirq
> >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
> >>> from the hypervisor.  However, that pirq was unmapped when the pci device
> >>> disabled its msi/msix, and it cannot be re-used; it may have been given
> >>> to a different pci device.
> >> Hm, but this implies that we do keep track of it.
> >>
> >>
> >> while (true)
> >> do
> >>  rmmod nvme
> >>  modprobe nvme
> >> done
> >>
> >> Things go boom without this patch. But with this patch does this
> >> still work? As in we don't run out of PIRQs?
> >>
> >> Thanks.
> >>> This exact situation is happening in a Xen guest where multiple NVMe
> >>> controllers (pci devices) are present.  The NVMe driver configures each
> >>> pci device's msi/msix twice; first to configure a single vector (to
> >>> talk to the controller for its configuration info), and then it disables
> >>> that msi/msix and re-configures with all the msi/msix it needs.  When
> >>> multiple NVMe controllers are present, this happens concurrently on all
> >>> of them, and in the time between controller A calling pci_disable_msix()
> >>> and then calling pci_enable_msix_range(), controller B enables its msix
> >>> and gets controller A's pirq allocated from the hypervisor.  Then when
> >>> controller A re-configures its msix, its first vector tries to re-use
> >>> the same pirq that it had before; but that pirq was allocated to
> >>> controller B, and thus the Xen event channel for controller A's re-used
> >>> pirq fails to map its irq to that pirq; the hypervisor already has the
> >>> pirq mapped elsewhere.
> >>>
> >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> >>> ---
> >>>  arch/x86/pci/xen.c | 23 +++++++----------------
> >>>  1 file changed, 7 insertions(+), 16 deletions(-)
> >>>
> >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> >>> index bedfab9..a00a6c0 100644
> >>> --- a/arch/x86/pci/xen.c
> >>> +++ b/arch/x86/pci/xen.c
> >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> >>>              return 1;
> >>>
> >>>      for_each_pci_msi_entry(msidesc, dev) {
> >>> -            __pci_read_msi_msg(msidesc, &msg);
> >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
> >>> -                xen_irq_from_pirq(pirq) < 0) {
> >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> -                    if (pirq < 0) {
> >>> -                            irq = -ENODEV;
> >>> -                            goto error;
> >>> -                    }
> >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
> >>> -                    __pci_write_msi_msg(msidesc, &msg);
> >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>> -            } else {
> >>> -                    dev_dbg(&dev->dev,
> >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
> >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> +            if (pirq < 0) {
> >>> +                    irq = -ENODEV;
> >>> +                    goto error;
> >>>              }
> >>> +            xen_msi_compose_msg(dev, pirq, &msg);
> >>> +            __pci_write_msi_msg(msidesc, &msg);
> >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
> >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
> >>>                                             (type == PCI_CAP_ID_MSIX) ?
> >>> --
> >>> 2.9.3
> >>>
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-09 15:59       ` Konrad Rzeszutek Wilk
@ 2017-01-09 19:30         ` Stefano Stabellini
  2017-01-10 15:57           ` Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-09 19:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Streetman, stefano, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> > <boris.ostrovsky@oracle.com> wrote:
> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> > >>> Do not read a pci device's msi message data to see if a pirq was
> > >>> previously configured for the device's msi/msix, as the old pirq was
> > >>> unmapped and may now be in use by another pci device.  The previous
> > >>> pirq should never be re-used; instead a new pirq should always be
> > >>> allocated from the hypervisor.
> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> > >> as we are not reusing them?
> > >
> > > Don't we free the pirq when we unmap it?
> > 
> > I think this is actually a bit worse than I initially thought.  After
> > looking a bit closer, and I think there's an asymmetry with pirq
> > allocation:
> 
> Lets include Stefano,
> 
> Thank you for digging in this! This has quite the deja-vu
> feeling as I believe I hit this at some point in the past and
> posted some possible ways of fixing this. But sadly I can't
> find the thread.

This issue seems to be caused by:

commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Date:   Wed Dec 1 14:51:44 2010 +0000

    xen: fix MSI setup and teardown for PV on HVM guests

which was a fix to a bug:

    This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
    trying to enable the same MSI for the second time: the old MSI to pirq
    mapping is still valid at this point but xen_hvm_setup_msi_irqs would
    try to assign a new pirq anyway.
    A simple way to reproduce this bug is to assign an MSI capable network
    card to a PV on HVM guest, if the user brings down the corresponding
    ethernet interface and up again, Linux would fail to enable MSIs on the
    device.

I don't remember any of the details. From the description of this bug,
it seems that Xen changed behavior in the past few years: before it used
to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
old MSI to pirq mapping is still valid at this point", the pirq couldn't
have been completely freed, then reassigned to somebody else the way it
is described in this email.

I think we should indentify the changeset or Xen version that introduced
the new behavior. If it is old enough, we might be able to just revert
af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
behavior conditional to the Xen version.


> > tl;dr:
> > 
> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
> > allocated, and reserved in the hypervisor
> > 
> > request_irq() -> an event channel is opened for the specific pirq, and
> > maps the pirq with the hypervisor
> > 
> > free_irq() -> the event channel is closed, and the pirq is unmapped,
> > but that unmap function also frees the pirq!  The hypervisor can/will
> > give it away to the next call to get_free_pirq.  However, the pci
> > msi/msix data area still contains the pirq number, and the next call
> > to request_irq() will re-use the pirq.
> > 
> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
> > (it's already been freed), and it also doesn't clear anything from the
> > msi/msix data area, so the pirq is still cached there.
> > 
> > 
> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
> > when the event channel is closed - or at least, not to change it to
> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
> > call into something in the Xen guest kernel that actually does the
> > pirq unmapping, and clear it from the msi data area (i.e.
> > pci_write_msi_msg)
> 
> The problem is that Xen changes have sailed a long long time ago.
> > 
> > Alternately, if the hypervisor doesn't change, then the call into the
> > hypervisor to actually allocate a pirq needs to move from the
> > pci_enable_msix_range() call to the request_irq() call?  So that when
> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
> > for any of the vectors; each request_irq/free_irq pair do the pirq
> > allocate-and-map/unmap...
> 
> 
> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
> from actually calling evtchn_close() until the pci_disable_msix() is done?

I think that's a reasonable alternative: we mask the evtchn, but do not
call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
Something like (not compiled, untested):

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 137bd0e..3174923 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
 		return;
 
 	mask_evtchn(evtchn);
-	xen_evtchn_close(evtchn);
+	if (!xen_hvm_domain())
+		xen_evtchn_close(evtchn);
 	xen_irq_info_cleanup(info);
 }
 

We want to keep the pirq allocated to the device - not closing the
evtchn seems like the right thing to do. I suggest we test for the
original bug too: enable/disable the network interface of an MSI capable
network card.


> > longer details:
> > 
> > The chain of function calls starts in the initial call to configure
> > the msi vectors, which eventually calls __pci_enable_msix_range (or
> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
> > either tries to re-use any cached pirq in the MSI data area, or (for
> > the first time setup) allocates a new pirq from the hypervisor via
> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
> > hypervisor's perspective, but it's not mapped to anything in the guest
> > kernel.
> > 
> > Then, the driver calls request_irq to actually start using the irq,
> > which calls __setup_irq to irq_startup to startup_pirq.  The
> > startup_pirq call actually creates the evtchn and binds it to the
> > previously allocated pirq via EVTCHNOP_bind_pirq.
> > 
> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
> > and is in use.  But then, when the driver doesn't want it anymore, it
> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
> > function closes the evtchn via EVTCHNOP_close.
> > 
> > Inside the hypervisor, in xen/common/event_channel.c in
> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
> > channel is) then it unmaps the pirq mapping via
> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
> > give away to anyone requesting a new pirq!
> > 
> > 
> > 
> > 
> > 
> > >
> > > -boris
> > >
> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
> > >>> it finds the vector was previously configured with a pirq, and that pirq
> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
> > >>> to a different pci device.
> > >> Hm, but this implies that we do keep track of it.
> > >>
> > >>
> > >> while (true)
> > >> do
> > >>  rmmod nvme
> > >>  modprobe nvme
> > >> done
> > >>
> > >> Things go boom without this patch. But with this patch does this
> > >> still work? As in we don't run out of PIRQs?
> > >>
> > >> Thanks.
> > >>> This exact situation is happening in a Xen guest where multiple NVMe
> > >>> controllers (pci devices) are present.  The NVMe driver configures each
> > >>> pci device's msi/msix twice; first to configure a single vector (to
> > >>> talk to the controller for its configuration info), and then it disables
> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
> > >>> multiple NVMe controllers are present, this happens concurrently on all
> > >>> of them, and in the time between controller A calling pci_disable_msix()
> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
> > >>> controller A re-configures its msix, its first vector tries to re-use
> > >>> the same pirq that it had before; but that pirq was allocated to
> > >>> controller B, and thus the Xen event channel for controller A's re-used
> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
> > >>> pirq mapped elsewhere.
> > >>>
> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> > >>> ---
> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
> > >>>
> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> > >>> index bedfab9..a00a6c0 100644
> > >>> --- a/arch/x86/pci/xen.c
> > >>> +++ b/arch/x86/pci/xen.c
> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > >>>              return 1;
> > >>>
> > >>>      for_each_pci_msi_entry(msidesc, dev) {
> > >>> -            __pci_read_msi_msg(msidesc, &msg);
> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
> > >>> -                xen_irq_from_pirq(pirq) < 0) {
> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
> > >>> -                    if (pirq < 0) {
> > >>> -                            irq = -ENODEV;
> > >>> -                            goto error;
> > >>> -                    }
> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> > >>> -            } else {
> > >>> -                    dev_dbg(&dev->dev,
> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
> > >>> +            if (pirq < 0) {
> > >>> +                    irq = -ENODEV;
> > >>> +                    goto error;
> > >>>              }
> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
> > >>> +            __pci_write_msi_msg(msidesc, &msg);
> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
> > >>> --
> > >>> 2.9.3
> > >>>
> > >
> > >
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xen.org
> > > https://lists.xen.org/xen-devel
> 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-09 19:30         ` Stefano Stabellini
@ 2017-01-10 15:57           ` Dan Streetman
  2017-01-10 18:41             ` Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-10 15:57 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, stefano, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
<sstabellini@kernel.org> wrote:
> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>> > <boris.ostrovsky@oracle.com> wrote:
>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>> > >>> Do not read a pci device's msi message data to see if a pirq was
>> > >>> previously configured for the device's msi/msix, as the old pirq was
>> > >>> unmapped and may now be in use by another pci device.  The previous
>> > >>> pirq should never be re-used; instead a new pirq should always be
>> > >>> allocated from the hypervisor.
>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>> > >> as we are not reusing them?
>> > >
>> > > Don't we free the pirq when we unmap it?
>> >
>> > I think this is actually a bit worse than I initially thought.  After
>> > looking a bit closer, and I think there's an asymmetry with pirq
>> > allocation:
>>
>> Lets include Stefano,
>>
>> Thank you for digging in this! This has quite the deja-vu
>> feeling as I believe I hit this at some point in the past and
>> posted some possible ways of fixing this. But sadly I can't
>> find the thread.
>
> This issue seems to be caused by:
>
> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Date:   Wed Dec 1 14:51:44 2010 +0000
>
>     xen: fix MSI setup and teardown for PV on HVM guests
>
> which was a fix to a bug:
>
>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>     trying to enable the same MSI for the second time: the old MSI to pirq
>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>     try to assign a new pirq anyway.
>     A simple way to reproduce this bug is to assign an MSI capable network
>     card to a PV on HVM guest, if the user brings down the corresponding
>     ethernet interface and up again, Linux would fail to enable MSIs on the
>     device.
>
> I don't remember any of the details. From the description of this bug,
> it seems that Xen changed behavior in the past few years: before it used
> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> have been completely freed, then reassigned to somebody else the way it
> is described in this email.
>
> I think we should indentify the changeset or Xen version that introduced
> the new behavior. If it is old enough, we might be able to just revert
> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> behavior conditional to the Xen version.

Are PT devices the only MSI-capable devices available in a Xen guest?
That's where I'm seeing this problem, with PT NVMe devices.

I can say that on the Xen guest with NVMe PT devices I'm testing on,
with the patch from this thread (which essentially reverts your commit
above) as well as some added debug to see the pirq numbers, cycles of
'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
hypervisor provides essentially the same pirqs for each modprobe,
since they were freed by the rmmod.

>
>
>> > tl;dr:
>> >
>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
>> > allocated, and reserved in the hypervisor
>> >
>> > request_irq() -> an event channel is opened for the specific pirq, and
>> > maps the pirq with the hypervisor
>> >
>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
>> > but that unmap function also frees the pirq!  The hypervisor can/will
>> > give it away to the next call to get_free_pirq.  However, the pci
>> > msi/msix data area still contains the pirq number, and the next call
>> > to request_irq() will re-use the pirq.
>> >
>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
>> > (it's already been freed), and it also doesn't clear anything from the
>> > msi/msix data area, so the pirq is still cached there.
>> >
>> >
>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
>> > when the event channel is closed - or at least, not to change it to
>> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
>> > call into something in the Xen guest kernel that actually does the
>> > pirq unmapping, and clear it from the msi data area (i.e.
>> > pci_write_msi_msg)
>>
>> The problem is that Xen changes have sailed a long long time ago.
>> >
>> > Alternately, if the hypervisor doesn't change, then the call into the
>> > hypervisor to actually allocate a pirq needs to move from the
>> > pci_enable_msix_range() call to the request_irq() call?  So that when
>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
>> > for any of the vectors; each request_irq/free_irq pair do the pirq
>> > allocate-and-map/unmap...
>>
>>
>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
>> from actually calling evtchn_close() until the pci_disable_msix() is done?
>
> I think that's a reasonable alternative: we mask the evtchn, but do not
> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
> Something like (not compiled, untested):
>
> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> index 137bd0e..3174923 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
>                 return;
>
>         mask_evtchn(evtchn);
> -       xen_evtchn_close(evtchn);
> +       if (!xen_hvm_domain())
> +               xen_evtchn_close(evtchn);

wouldn't we need to also add code to the pci_disable_msix path that
would actually close the evtchn?  Would this leave the evtchn around
forever?

>         xen_irq_info_cleanup(info);
>  }
>
>
> We want to keep the pirq allocated to the device - not closing the
> evtchn seems like the right thing to do. I suggest we test for the
> original bug too: enable/disable the network interface of an MSI capable
> network card.

I don't have a Xen hypervisor setup myself, I'm just using AWS, but
I'll try to test this on an instance with a SRIOV/PT nic.

>
>
>> > longer details:
>> >
>> > The chain of function calls starts in the initial call to configure
>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
>> > either tries to re-use any cached pirq in the MSI data area, or (for
>> > the first time setup) allocates a new pirq from the hypervisor via
>> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
>> > hypervisor's perspective, but it's not mapped to anything in the guest
>> > kernel.
>> >
>> > Then, the driver calls request_irq to actually start using the irq,
>> > which calls __setup_irq to irq_startup to startup_pirq.  The
>> > startup_pirq call actually creates the evtchn and binds it to the
>> > previously allocated pirq via EVTCHNOP_bind_pirq.
>> >
>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
>> > and is in use.  But then, when the driver doesn't want it anymore, it
>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
>> > function closes the evtchn via EVTCHNOP_close.
>> >
>> > Inside the hypervisor, in xen/common/event_channel.c in
>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
>> > channel is) then it unmaps the pirq mapping via
>> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
>> > give away to anyone requesting a new pirq!
>> >
>> >
>> >
>> >
>> >
>> > >
>> > > -boris
>> > >
>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
>> > >>> it finds the vector was previously configured with a pirq, and that pirq
>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
>> > >>> to a different pci device.
>> > >> Hm, but this implies that we do keep track of it.
>> > >>
>> > >>
>> > >> while (true)
>> > >> do
>> > >>  rmmod nvme
>> > >>  modprobe nvme
>> > >> done
>> > >>
>> > >> Things go boom without this patch. But with this patch does this
>> > >> still work? As in we don't run out of PIRQs?
>> > >>
>> > >> Thanks.
>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
>> > >>> controllers (pci devices) are present.  The NVMe driver configures each
>> > >>> pci device's msi/msix twice; first to configure a single vector (to
>> > >>> talk to the controller for its configuration info), and then it disables
>> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
>> > >>> multiple NVMe controllers are present, this happens concurrently on all
>> > >>> of them, and in the time between controller A calling pci_disable_msix()
>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
>> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
>> > >>> controller A re-configures its msix, its first vector tries to re-use
>> > >>> the same pirq that it had before; but that pirq was allocated to
>> > >>> controller B, and thus the Xen event channel for controller A's re-used
>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
>> > >>> pirq mapped elsewhere.
>> > >>>
>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>> > >>> ---
>> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
>> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
>> > >>>
>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> > >>> index bedfab9..a00a6c0 100644
>> > >>> --- a/arch/x86/pci/xen.c
>> > >>> +++ b/arch/x86/pci/xen.c
>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> > >>>              return 1;
>> > >>>
>> > >>>      for_each_pci_msi_entry(msidesc, dev) {
>> > >>> -            __pci_read_msi_msg(msidesc, &msg);
>> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
>> > >>> -                xen_irq_from_pirq(pirq) < 0) {
>> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
>> > >>> -                    if (pirq < 0) {
>> > >>> -                            irq = -ENODEV;
>> > >>> -                            goto error;
>> > >>> -                    }
>> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
>> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
>> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>> > >>> -            } else {
>> > >>> -                    dev_dbg(&dev->dev,
>> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
>> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
>> > >>> +            if (pirq < 0) {
>> > >>> +                    irq = -ENODEV;
>> > >>> +                    goto error;
>> > >>>              }
>> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
>> > >>> +            __pci_write_msi_msg(msidesc, &msg);
>> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
>> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
>> > >>> --
>> > >>> 2.9.3
>> > >>>
>> > >
>> > >
>> > > _______________________________________________
>> > > Xen-devel mailing list
>> > > Xen-devel@lists.xen.org
>> > > https://lists.xen.org/xen-devel
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-10 15:57           ` Dan Streetman
@ 2017-01-10 18:41             ` Dan Streetman
  2017-01-10 19:03               ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-10 18:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> <sstabellini@kernel.org> wrote:
>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>>> > <boris.ostrovsky@oracle.com> wrote:
>>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>> > >>> Do not read a pci device's msi message data to see if a pirq was
>>> > >>> previously configured for the device's msi/msix, as the old pirq was
>>> > >>> unmapped and may now be in use by another pci device.  The previous
>>> > >>> pirq should never be re-used; instead a new pirq should always be
>>> > >>> allocated from the hypervisor.
>>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>>> > >> as we are not reusing them?
>>> > >
>>> > > Don't we free the pirq when we unmap it?
>>> >
>>> > I think this is actually a bit worse than I initially thought.  After
>>> > looking a bit closer, and I think there's an asymmetry with pirq
>>> > allocation:
>>>
>>> Lets include Stefano,
>>>
>>> Thank you for digging in this! This has quite the deja-vu
>>> feeling as I believe I hit this at some point in the past and
>>> posted some possible ways of fixing this. But sadly I can't
>>> find the thread.
>>
>> This issue seems to be caused by:
>>
>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> Date:   Wed Dec 1 14:51:44 2010 +0000
>>
>>     xen: fix MSI setup and teardown for PV on HVM guests
>>
>> which was a fix to a bug:
>>
>>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>>     trying to enable the same MSI for the second time: the old MSI to pirq
>>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>>     try to assign a new pirq anyway.
>>     A simple way to reproduce this bug is to assign an MSI capable network
>>     card to a PV on HVM guest, if the user brings down the corresponding
>>     ethernet interface and up again, Linux would fail to enable MSIs on the
>>     device.
>>
>> I don't remember any of the details. From the description of this bug,
>> it seems that Xen changed behavior in the past few years: before it used
>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>> have been completely freed, then reassigned to somebody else the way it
>> is described in this email.
>>
>> I think we should indentify the changeset or Xen version that introduced
>> the new behavior. If it is old enough, we might be able to just revert
>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>> behavior conditional to the Xen version.
>
> Are PT devices the only MSI-capable devices available in a Xen guest?
> That's where I'm seeing this problem, with PT NVMe devices.
>
> I can say that on the Xen guest with NVMe PT devices I'm testing on,
> with the patch from this thread (which essentially reverts your commit
> above) as well as some added debug to see the pirq numbers, cycles of
> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> hypervisor provides essentially the same pirqs for each modprobe,
> since they were freed by the rmmod.
>
>>
>>
>>> > tl;dr:
>>> >
>>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
>>> > allocated, and reserved in the hypervisor
>>> >
>>> > request_irq() -> an event channel is opened for the specific pirq, and
>>> > maps the pirq with the hypervisor
>>> >
>>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
>>> > but that unmap function also frees the pirq!  The hypervisor can/will
>>> > give it away to the next call to get_free_pirq.  However, the pci
>>> > msi/msix data area still contains the pirq number, and the next call
>>> > to request_irq() will re-use the pirq.
>>> >
>>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
>>> > (it's already been freed), and it also doesn't clear anything from the
>>> > msi/msix data area, so the pirq is still cached there.
>>> >
>>> >
>>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
>>> > when the event channel is closed - or at least, not to change it to
>>> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
>>> > call into something in the Xen guest kernel that actually does the
>>> > pirq unmapping, and clear it from the msi data area (i.e.
>>> > pci_write_msi_msg)
>>>
>>> The problem is that Xen changes have sailed a long long time ago.
>>> >
>>> > Alternately, if the hypervisor doesn't change, then the call into the
>>> > hypervisor to actually allocate a pirq needs to move from the
>>> > pci_enable_msix_range() call to the request_irq() call?  So that when
>>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
>>> > for any of the vectors; each request_irq/free_irq pair do the pirq
>>> > allocate-and-map/unmap...
>>>
>>>
>>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
>>> from actually calling evtchn_close() until the pci_disable_msix() is done?
>>
>> I think that's a reasonable alternative: we mask the evtchn, but do not
>> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
>> Something like (not compiled, untested):
>>
>> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> index 137bd0e..3174923 100644
>> --- a/drivers/xen/events/events_base.c
>> +++ b/drivers/xen/events/events_base.c
>> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
>>                 return;
>>
>>         mask_evtchn(evtchn);
>> -       xen_evtchn_close(evtchn);
>> +       if (!xen_hvm_domain())
>> +               xen_evtchn_close(evtchn);
>
> wouldn't we need to also add code to the pci_disable_msix path that
> would actually close the evtchn?  Would this leave the evtchn around
> forever?
>
>>         xen_irq_info_cleanup(info);
>>  }
>>
>>
>> We want to keep the pirq allocated to the device - not closing the
>> evtchn seems like the right thing to do. I suggest we test for the
>> original bug too: enable/disable the network interface of an MSI capable
>> network card.
>
> I don't have a Xen hypervisor setup myself, I'm just using AWS, but
> I'll try to test this on an instance with a SRIOV/PT nic.

I started an instance with SRIOV nics, with the latest upstream kernel
plus this patch and debug to show the pirqs.  Taking an interface down
and back up uses the same pirq, because the driver only does
free_irq/request_irq, which just closes/reopens the evtchn using the
same pirq (which is cached in the struct irq_info) - it doesn't
disable/reenable the MSIX vectors.  Doing a complete rmmod/modprobe of
the driver does disable and reenable the MSIX vectors, but the
hypervisor provides the same pirqs from the get_free_pirq() call that
were used before (since nothing else is asking for free pirqs).

>
>>
>>
>>> > longer details:
>>> >
>>> > The chain of function calls starts in the initial call to configure
>>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
>>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
>>> > either tries to re-use any cached pirq in the MSI data area, or (for
>>> > the first time setup) allocates a new pirq from the hypervisor via
>>> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
>>> > hypervisor's perspective, but it's not mapped to anything in the guest
>>> > kernel.
>>> >
>>> > Then, the driver calls request_irq to actually start using the irq,
>>> > which calls __setup_irq to irq_startup to startup_pirq.  The
>>> > startup_pirq call actually creates the evtchn and binds it to the
>>> > previously allocated pirq via EVTCHNOP_bind_pirq.
>>> >
>>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
>>> > and is in use.  But then, when the driver doesn't want it anymore, it
>>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
>>> > function closes the evtchn via EVTCHNOP_close.
>>> >
>>> > Inside the hypervisor, in xen/common/event_channel.c in
>>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
>>> > channel is) then it unmaps the pirq mapping via
>>> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
>>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
>>> > give away to anyone requesting a new pirq!
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > >
>>> > > -boris
>>> > >
>>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
>>> > >>> it finds the vector was previously configured with a pirq, and that pirq
>>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>>> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
>>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
>>> > >>> to a different pci device.
>>> > >> Hm, but this implies that we do keep track of it.
>>> > >>
>>> > >>
>>> > >> while (true)
>>> > >> do
>>> > >>  rmmod nvme
>>> > >>  modprobe nvme
>>> > >> done
>>> > >>
>>> > >> Things go boom without this patch. But with this patch does this
>>> > >> still work? As in we don't run out of PIRQs?
>>> > >>
>>> > >> Thanks.
>>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
>>> > >>> controllers (pci devices) are present.  The NVMe driver configures each
>>> > >>> pci device's msi/msix twice; first to configure a single vector (to
>>> > >>> talk to the controller for its configuration info), and then it disables
>>> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
>>> > >>> multiple NVMe controllers are present, this happens concurrently on all
>>> > >>> of them, and in the time between controller A calling pci_disable_msix()
>>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
>>> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
>>> > >>> controller A re-configures its msix, its first vector tries to re-use
>>> > >>> the same pirq that it had before; but that pirq was allocated to
>>> > >>> controller B, and thus the Xen event channel for controller A's re-used
>>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
>>> > >>> pirq mapped elsewhere.
>>> > >>>
>>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>> > >>> ---
>>> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
>>> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
>>> > >>>
>>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>>> > >>> index bedfab9..a00a6c0 100644
>>> > >>> --- a/arch/x86/pci/xen.c
>>> > >>> +++ b/arch/x86/pci/xen.c
>>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>>> > >>>              return 1;
>>> > >>>
>>> > >>>      for_each_pci_msi_entry(msidesc, dev) {
>>> > >>> -            __pci_read_msi_msg(msidesc, &msg);
>>> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>>> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>>> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
>>> > >>> -                xen_irq_from_pirq(pirq) < 0) {
>>> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> > >>> -                    if (pirq < 0) {
>>> > >>> -                            irq = -ENODEV;
>>> > >>> -                            goto error;
>>> > >>> -                    }
>>> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
>>> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
>>> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>> > >>> -            } else {
>>> > >>> -                    dev_dbg(&dev->dev,
>>> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
>>> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> > >>> +            if (pirq < 0) {
>>> > >>> +                    irq = -ENODEV;
>>> > >>> +                    goto error;
>>> > >>>              }
>>> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
>>> > >>> +            __pci_write_msi_msg(msidesc, &msg);
>>> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>>> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
>>> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
>>> > >>> --
>>> > >>> 2.9.3
>>> > >>>
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > Xen-devel mailing list
>>> > > Xen-devel@lists.xen.org
>>> > > https://lists.xen.org/xen-devel
>>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-10 18:41             ` Dan Streetman
@ 2017-01-10 19:03               ` Stefano Stabellini
  2017-01-10 21:32                 ` Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-10 19:03 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Tue, 10 Jan 2017, Dan Streetman wrote:
> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> > <sstabellini@kernel.org> wrote:
> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >>> > <boris.ostrovsky@oracle.com> wrote:
> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >>> > >>> unmapped and may now be in use by another pci device.  The previous
> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >>> > >>> allocated from the hypervisor.
> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >>> > >> as we are not reusing them?
> >>> > >
> >>> > > Don't we free the pirq when we unmap it?
> >>> >
> >>> > I think this is actually a bit worse than I initially thought.  After
> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >>> > allocation:
> >>>
> >>> Lets include Stefano,
> >>>
> >>> Thank you for digging in this! This has quite the deja-vu
> >>> feeling as I believe I hit this at some point in the past and
> >>> posted some possible ways of fixing this. But sadly I can't
> >>> find the thread.
> >>
> >> This issue seems to be caused by:
> >>
> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >> Date:   Wed Dec 1 14:51:44 2010 +0000
> >>
> >>     xen: fix MSI setup and teardown for PV on HVM guests
> >>
> >> which was a fix to a bug:
> >>
> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >>     trying to enable the same MSI for the second time: the old MSI to pirq
> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >>     try to assign a new pirq anyway.
> >>     A simple way to reproduce this bug is to assign an MSI capable network
> >>     card to a PV on HVM guest, if the user brings down the corresponding
> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
> >>     device.
> >>
> >> I don't remember any of the details. From the description of this bug,
> >> it seems that Xen changed behavior in the past few years: before it used
> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >> have been completely freed, then reassigned to somebody else the way it
> >> is described in this email.
> >>
> >> I think we should indentify the changeset or Xen version that introduced
> >> the new behavior. If it is old enough, we might be able to just revert
> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >> behavior conditional to the Xen version.
> >
> > Are PT devices the only MSI-capable devices available in a Xen guest?
> > That's where I'm seeing this problem, with PT NVMe devices.

They are the main ones. It is possible to have emulated virtio devices
with emulated MSIs, for example virtio-net. Althought they are not in
the Xen Project CI-loop, so I wouldn't be surprised if they are broken
too.


> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> > with the patch from this thread (which essentially reverts your commit
> > above) as well as some added debug to see the pirq numbers, cycles of
> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> > hypervisor provides essentially the same pirqs for each modprobe,
> > since they were freed by the rmmod.

I am fine with reverting the old patch, but we need to understand what
caused the change in behavior first. Maybe somebody else with a Xen PCI
passthrough setup at hand can help with that.

In the Xen code I can still see:

    case ECS_PIRQ: {
        struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);

        if ( !pirq )
            break;
        if ( !is_hvm_domain(d1) )
            pirq_guest_unbind(d1, pirq);

which means that pirq_guest_unbind should only be called on evtchn_close
if the guest is not an HVM guest.


> >>> > tl;dr:
> >>> >
> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
> >>> > allocated, and reserved in the hypervisor
> >>> >
> >>> > request_irq() -> an event channel is opened for the specific pirq, and
> >>> > maps the pirq with the hypervisor
> >>> >
> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
> >>> > but that unmap function also frees the pirq!  The hypervisor can/will
> >>> > give it away to the next call to get_free_pirq.  However, the pci
> >>> > msi/msix data area still contains the pirq number, and the next call
> >>> > to request_irq() will re-use the pirq.
> >>> >
> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
> >>> > (it's already been freed), and it also doesn't clear anything from the
> >>> > msi/msix data area, so the pirq is still cached there.
> >>> >
> >>> >
> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
> >>> > when the event channel is closed - or at least, not to change it to
> >>> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
> >>> > call into something in the Xen guest kernel that actually does the
> >>> > pirq unmapping, and clear it from the msi data area (i.e.
> >>> > pci_write_msi_msg)
> >>>
> >>> The problem is that Xen changes have sailed a long long time ago.
> >>> >
> >>> > Alternately, if the hypervisor doesn't change, then the call into the
> >>> > hypervisor to actually allocate a pirq needs to move from the
> >>> > pci_enable_msix_range() call to the request_irq() call?  So that when
> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq
> >>> > allocate-and-map/unmap...
> >>>
> >>>
> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
> >>> from actually calling evtchn_close() until the pci_disable_msix() is done?
> >>
> >> I think that's a reasonable alternative: we mask the evtchn, but do not
> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
> >> Something like (not compiled, untested):
> >>
> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> >> index 137bd0e..3174923 100644
> >> --- a/drivers/xen/events/events_base.c
> >> +++ b/drivers/xen/events/events_base.c
> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
> >>                 return;
> >>
> >>         mask_evtchn(evtchn);
> >> -       xen_evtchn_close(evtchn);
> >> +       if (!xen_hvm_domain())
> >> +               xen_evtchn_close(evtchn);
> >
> > wouldn't we need to also add code to the pci_disable_msix path that
> > would actually close the evtchn?  Would this leave the evtchn around
> > forever?
> >
> >>         xen_irq_info_cleanup(info);
> >>  }
> >>
> >>
> >> We want to keep the pirq allocated to the device - not closing the
> >> evtchn seems like the right thing to do. I suggest we test for the
> >> original bug too: enable/disable the network interface of an MSI capable
> >> network card.
> >
> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but
> > I'll try to test this on an instance with a SRIOV/PT nic.
> 
> I started an instance with SRIOV nics, with the latest upstream kernel
> plus this patch and debug to show the pirqs.  Taking an interface down
> and back up uses the same pirq, because the driver only does
> free_irq/request_irq, which just closes/reopens the evtchn using the
> same pirq (which is cached in the struct irq_info) - it doesn't
> disable/reenable the MSIX vectors.  Doing a complete rmmod/modprobe of
> the driver does disable and reenable the MSIX vectors, but the
> hypervisor provides the same pirqs from the get_free_pirq() call that
> were used before (since nothing else is asking for free pirqs).

Good! Thanks for testing, it's very helpful. I believe it should work
even if the hypervisor returned a different pirq though.


> >>> > longer details:
> >>> >
> >>> > The chain of function calls starts in the initial call to configure
> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
> >>> > either tries to re-use any cached pirq in the MSI data area, or (for
> >>> > the first time setup) allocates a new pirq from the hypervisor via
> >>> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
> >>> > hypervisor's perspective, but it's not mapped to anything in the guest
> >>> > kernel.
> >>> >
> >>> > Then, the driver calls request_irq to actually start using the irq,
> >>> > which calls __setup_irq to irq_startup to startup_pirq.  The
> >>> > startup_pirq call actually creates the evtchn and binds it to the
> >>> > previously allocated pirq via EVTCHNOP_bind_pirq.
> >>> >
> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
> >>> > and is in use.  But then, when the driver doesn't want it anymore, it
> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
> >>> > function closes the evtchn via EVTCHNOP_close.
> >>> >
> >>> > Inside the hypervisor, in xen/common/event_channel.c in
> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
> >>> > channel is) then it unmaps the pirq mapping via
> >>> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
> >>> > give away to anyone requesting a new pirq!
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > > -boris
> >>> > >
> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq
> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
> >>> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
> >>> > >>> to a different pci device.
> >>> > >> Hm, but this implies that we do keep track of it.
> >>> > >>
> >>> > >>
> >>> > >> while (true)
> >>> > >> do
> >>> > >>  rmmod nvme
> >>> > >>  modprobe nvme
> >>> > >> done
> >>> > >>
> >>> > >> Things go boom without this patch. But with this patch does this
> >>> > >> still work? As in we don't run out of PIRQs?
> >>> > >>
> >>> > >> Thanks.
> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
> >>> > >>> controllers (pci devices) are present.  The NVMe driver configures each
> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to
> >>> > >>> talk to the controller for its configuration info), and then it disables
> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all
> >>> > >>> of them, and in the time between controller A calling pci_disable_msix()
> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
> >>> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
> >>> > >>> controller A re-configures its msix, its first vector tries to re-use
> >>> > >>> the same pirq that it had before; but that pirq was allocated to
> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used
> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
> >>> > >>> pirq mapped elsewhere.
> >>> > >>>
> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> >>> > >>> ---
> >>> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
> >>> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
> >>> > >>>
> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> >>> > >>> index bedfab9..a00a6c0 100644
> >>> > >>> --- a/arch/x86/pci/xen.c
> >>> > >>> +++ b/arch/x86/pci/xen.c
> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> >>> > >>>              return 1;
> >>> > >>>
> >>> > >>>      for_each_pci_msi_entry(msidesc, dev) {
> >>> > >>> -            __pci_read_msi_msg(msidesc, &msg);
> >>> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> >>> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> >>> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
> >>> > >>> -                xen_irq_from_pirq(pirq) < 0) {
> >>> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> > >>> -                    if (pirq < 0) {
> >>> > >>> -                            irq = -ENODEV;
> >>> > >>> -                            goto error;
> >>> > >>> -                    }
> >>> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
> >>> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
> >>> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>> > >>> -            } else {
> >>> > >>> -                    dev_dbg(&dev->dev,
> >>> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
> >>> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
> >>> > >>> +            if (pirq < 0) {
> >>> > >>> +                    irq = -ENODEV;
> >>> > >>> +                    goto error;
> >>> > >>>              }
> >>> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
> >>> > >>> +            __pci_write_msi_msg(msidesc, &msg);
> >>> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> >>> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
> >>> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
> >>> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
> >>> > >>> --
> >>> > >>> 2.9.3
> >>> > >>>
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > Xen-devel mailing list
> >>> > > Xen-devel@lists.xen.org
> >>> > > https://lists.xen.org/xen-devel
> >>>
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-10 19:03               ` Stefano Stabellini
@ 2017-01-10 21:32                 ` Dan Streetman
  2017-01-10 23:28                   ` Boris Ostrovsky
  2017-01-11  1:25                   ` Stefano Stabellini
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Streetman @ 2017-01-10 21:32 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
<sstabellini@kernel.org> wrote:
> On Tue, 10 Jan 2017, Dan Streetman wrote:
>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>> > <sstabellini@kernel.org> wrote:
>> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>> >>> > <boris.ostrovsky@oracle.com> wrote:
>> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
>> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
>> >>> > >>> unmapped and may now be in use by another pci device.  The previous
>> >>> > >>> pirq should never be re-used; instead a new pirq should always be
>> >>> > >>> allocated from the hypervisor.
>> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>> >>> > >> as we are not reusing them?
>> >>> > >
>> >>> > > Don't we free the pirq when we unmap it?
>> >>> >
>> >>> > I think this is actually a bit worse than I initially thought.  After
>> >>> > looking a bit closer, and I think there's an asymmetry with pirq
>> >>> > allocation:
>> >>>
>> >>> Lets include Stefano,
>> >>>
>> >>> Thank you for digging in this! This has quite the deja-vu
>> >>> feeling as I believe I hit this at some point in the past and
>> >>> posted some possible ways of fixing this. But sadly I can't
>> >>> find the thread.
>> >>
>> >> This issue seems to be caused by:
>> >>
>> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> >> Date:   Wed Dec 1 14:51:44 2010 +0000
>> >>
>> >>     xen: fix MSI setup and teardown for PV on HVM guests
>> >>
>> >> which was a fix to a bug:
>> >>
>> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>> >>     trying to enable the same MSI for the second time: the old MSI to pirq
>> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>> >>     try to assign a new pirq anyway.
>> >>     A simple way to reproduce this bug is to assign an MSI capable network
>> >>     card to a PV on HVM guest, if the user brings down the corresponding
>> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
>> >>     device.
>> >>
>> >> I don't remember any of the details. From the description of this bug,
>> >> it seems that Xen changed behavior in the past few years: before it used
>> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>> >> have been completely freed, then reassigned to somebody else the way it
>> >> is described in this email.
>> >>
>> >> I think we should indentify the changeset or Xen version that introduced
>> >> the new behavior. If it is old enough, we might be able to just revert
>> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>> >> behavior conditional to the Xen version.
>> >
>> > Are PT devices the only MSI-capable devices available in a Xen guest?
>> > That's where I'm seeing this problem, with PT NVMe devices.
>
> They are the main ones. It is possible to have emulated virtio devices
> with emulated MSIs, for example virtio-net. Althought they are not in
> the Xen Project CI-loop, so I wouldn't be surprised if they are broken
> too.
>
>
>> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
>> > with the patch from this thread (which essentially reverts your commit
>> > above) as well as some added debug to see the pirq numbers, cycles of
>> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>> > hypervisor provides essentially the same pirqs for each modprobe,
>> > since they were freed by the rmmod.
>
> I am fine with reverting the old patch, but we need to understand what
> caused the change in behavior first. Maybe somebody else with a Xen PCI
> passthrough setup at hand can help with that.
>
> In the Xen code I can still see:
>
>     case ECS_PIRQ: {
>         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>
>         if ( !pirq )
>             break;
>         if ( !is_hvm_domain(d1) )
>             pirq_guest_unbind(d1, pirq);
>
> which means that pirq_guest_unbind should only be called on evtchn_close
> if the guest is not an HVM guest.

I tried an experiment to call get_free_pirq on both sides of a
evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
see something interesting; each nic uses 3 MSIs, and it looks like
when they shut down, each nic's 3 pirqs are not available until after
the nic is done shutting down, so it seems like closing the evtchn
isn't what makes the pirq free.

[3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
[3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
[3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
[3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
[3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
[3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
[3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
[3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps

nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.

[3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
0, xen_pvh_domain() == 0

just to be sure, a bit of dbg so I know what domain this is :-)

[3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
[3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
[3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
[3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
[3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
[3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
[3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
[3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
[3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88

nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
none of those become available for get_free_pirq.

[3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98

aha, now nic 4 has fully finished shutting down, and nic 3 has started
shutdown; we see those pirqs from nic 4 are now available.  So it must
not be evtchn closing that frees the pirqs.

[3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
[3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
[3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
[3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
[3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
[3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
[3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
[3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85


so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
(and recompiled/rebooted) and found the pirqs have already been freed
before that is called:

[3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
[3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
[3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
[3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
[3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
[3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
[3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
[3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
[3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
[3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
[3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
[3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95


I'm still following thru the kernel code, but it's not immediately
obvious what exactly is telling the hypervisor to free the pirqs; any
idea?

>From the hypervisor code, it seems that the pirq is "available" via
is_free_pirq():
    return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
        pirq->arch.hvm.emuirq == IRQ_UNBOUND));

when the evtchn is closed, it does:
        if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
            unmap_domain_pirq_emuirq(d1, pirq->pirq);

and that call to unmap_domain_pirq_emuirq does:
        info->arch.hvm.emuirq = IRQ_UNBOUND;

so, the only thing left is to clear pirq->arch.irq,but the only place
I can find that does that is clear_domain_irq_pirq(), which is only
called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
seeing where either of those would be called when all the kernel is
doing is disabling a pci device.







>
>
>> >>> > tl;dr:
>> >>> >
>> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
>> >>> > allocated, and reserved in the hypervisor
>> >>> >
>> >>> > request_irq() -> an event channel is opened for the specific pirq, and
>> >>> > maps the pirq with the hypervisor
>> >>> >
>> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
>> >>> > but that unmap function also frees the pirq!  The hypervisor can/will
>> >>> > give it away to the next call to get_free_pirq.  However, the pci
>> >>> > msi/msix data area still contains the pirq number, and the next call
>> >>> > to request_irq() will re-use the pirq.
>> >>> >
>> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
>> >>> > (it's already been freed), and it also doesn't clear anything from the
>> >>> > msi/msix data area, so the pirq is still cached there.
>> >>> >
>> >>> >
>> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
>> >>> > when the event channel is closed - or at least, not to change it to
>> >>> > IRQ_UNBOUND state?  And, the pci_disable_msix call should eventually
>> >>> > call into something in the Xen guest kernel that actually does the
>> >>> > pirq unmapping, and clear it from the msi data area (i.e.
>> >>> > pci_write_msi_msg)
>> >>>
>> >>> The problem is that Xen changes have sailed a long long time ago.
>> >>> >
>> >>> > Alternately, if the hypervisor doesn't change, then the call into the
>> >>> > hypervisor to actually allocate a pirq needs to move from the
>> >>> > pci_enable_msix_range() call to the request_irq() call?  So that when
>> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
>> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq
>> >>> > allocate-and-map/unmap...
>> >>>
>> >>>
>> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
>> >>> from actually calling evtchn_close() until the pci_disable_msix() is done?
>> >>
>> >> I think that's a reasonable alternative: we mask the evtchn, but do not
>> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
>> >> Something like (not compiled, untested):
>> >>
>> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> >> index 137bd0e..3174923 100644
>> >> --- a/drivers/xen/events/events_base.c
>> >> +++ b/drivers/xen/events/events_base.c
>> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
>> >>                 return;
>> >>
>> >>         mask_evtchn(evtchn);
>> >> -       xen_evtchn_close(evtchn);
>> >> +       if (!xen_hvm_domain())
>> >> +               xen_evtchn_close(evtchn);
>> >
>> > wouldn't we need to also add code to the pci_disable_msix path that
>> > would actually close the evtchn?  Would this leave the evtchn around
>> > forever?
>> >
>> >>         xen_irq_info_cleanup(info);
>> >>  }
>> >>
>> >>
>> >> We want to keep the pirq allocated to the device - not closing the
>> >> evtchn seems like the right thing to do. I suggest we test for the
>> >> original bug too: enable/disable the network interface of an MSI capable
>> >> network card.
>> >
>> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but
>> > I'll try to test this on an instance with a SRIOV/PT nic.
>>
>> I started an instance with SRIOV nics, with the latest upstream kernel
>> plus this patch and debug to show the pirqs.  Taking an interface down
>> and back up uses the same pirq, because the driver only does
>> free_irq/request_irq, which just closes/reopens the evtchn using the
>> same pirq (which is cached in the struct irq_info) - it doesn't
>> disable/reenable the MSIX vectors.  Doing a complete rmmod/modprobe of
>> the driver does disable and reenable the MSIX vectors, but the
>> hypervisor provides the same pirqs from the get_free_pirq() call that
>> were used before (since nothing else is asking for free pirqs).
>
> Good! Thanks for testing, it's very helpful. I believe it should work
> even if the hypervisor returned a different pirq though.
>
>
>> >>> > longer details:
>> >>> >
>> >>> > The chain of function calls starts in the initial call to configure
>> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
>> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
>> >>> > either tries to re-use any cached pirq in the MSI data area, or (for
>> >>> > the first time setup) allocates a new pirq from the hypervisor via
>> >>> > PHYSDEVOP_get_free_pirq.  That pirq is then reserved from the
>> >>> > hypervisor's perspective, but it's not mapped to anything in the guest
>> >>> > kernel.
>> >>> >
>> >>> > Then, the driver calls request_irq to actually start using the irq,
>> >>> > which calls __setup_irq to irq_startup to startup_pirq.  The
>> >>> > startup_pirq call actually creates the evtchn and binds it to the
>> >>> > previously allocated pirq via EVTCHNOP_bind_pirq.
>> >>> >
>> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
>> >>> > and is in use.  But then, when the driver doesn't want it anymore, it
>> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
>> >>> > function closes the evtchn via EVTCHNOP_close.
>> >>> >
>> >>> > Inside the hypervisor, in xen/common/event_channel.c in
>> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
>> >>> > channel is) then it unmaps the pirq mapping via
>> >>> > unmap_domain_pirq_emuirq.  This unmaps the pirq, but also puts it back
>> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
>> >>> > give away to anyone requesting a new pirq!
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > >
>> >>> > > -boris
>> >>> > >
>> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
>> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq
>> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>> >>> > >>> from the hypervisor.  However, that pirq was unmapped when the pci device
>> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
>> >>> > >>> to a different pci device.
>> >>> > >> Hm, but this implies that we do keep track of it.
>> >>> > >>
>> >>> > >>
>> >>> > >> while (true)
>> >>> > >> do
>> >>> > >>  rmmod nvme
>> >>> > >>  modprobe nvme
>> >>> > >> done
>> >>> > >>
>> >>> > >> Things go boom without this patch. But with this patch does this
>> >>> > >> still work? As in we don't run out of PIRQs?
>> >>> > >>
>> >>> > >> Thanks.
>> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
>> >>> > >>> controllers (pci devices) are present.  The NVMe driver configures each
>> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to
>> >>> > >>> talk to the controller for its configuration info), and then it disables
>> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs.  When
>> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all
>> >>> > >>> of them, and in the time between controller A calling pci_disable_msix()
>> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
>> >>> > >>> and gets controller A's pirq allocated from the hypervisor.  Then when
>> >>> > >>> controller A re-configures its msix, its first vector tries to re-use
>> >>> > >>> the same pirq that it had before; but that pirq was allocated to
>> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used
>> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
>> >>> > >>> pirq mapped elsewhere.
>> >>> > >>>
>> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>> >>> > >>> ---
>> >>> > >>>  arch/x86/pci/xen.c | 23 +++++++----------------
>> >>> > >>>  1 file changed, 7 insertions(+), 16 deletions(-)
>> >>> > >>>
>> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> >>> > >>> index bedfab9..a00a6c0 100644
>> >>> > >>> --- a/arch/x86/pci/xen.c
>> >>> > >>> +++ b/arch/x86/pci/xen.c
>> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> >>> > >>>              return 1;
>> >>> > >>>
>> >>> > >>>      for_each_pci_msi_entry(msidesc, dev) {
>> >>> > >>> -            __pci_read_msi_msg(msidesc, &msg);
>> >>> > >>> -            pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> >>> > >>> -                    ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> >>> > >>> -            if (msg.data != XEN_PIRQ_MSI_DATA ||
>> >>> > >>> -                xen_irq_from_pirq(pirq) < 0) {
>> >>> > >>> -                    pirq = xen_allocate_pirq_msi(dev, msidesc);
>> >>> > >>> -                    if (pirq < 0) {
>> >>> > >>> -                            irq = -ENODEV;
>> >>> > >>> -                            goto error;
>> >>> > >>> -                    }
>> >>> > >>> -                    xen_msi_compose_msg(dev, pirq, &msg);
>> >>> > >>> -                    __pci_write_msi_msg(msidesc, &msg);
>> >>> > >>> -                    dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>> >>> > >>> -            } else {
>> >>> > >>> -                    dev_dbg(&dev->dev,
>> >>> > >>> -                            "xen: msi already bound to pirq=%d\n", pirq);
>> >>> > >>> +            pirq = xen_allocate_pirq_msi(dev, msidesc);
>> >>> > >>> +            if (pirq < 0) {
>> >>> > >>> +                    irq = -ENODEV;
>> >>> > >>> +                    goto error;
>> >>> > >>>              }
>> >>> > >>> +            xen_msi_compose_msg(dev, pirq, &msg);
>> >>> > >>> +            __pci_write_msi_msg(msidesc, &msg);
>> >>> > >>> +            dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>> >>> > >>>              irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>> >>> > >>>                                             (type == PCI_CAP_ID_MSI) ? nvec : 1,
>> >>> > >>>                                             (type == PCI_CAP_ID_MSIX) ?
>> >>> > >>> --
>> >>> > >>> 2.9.3
>> >>> > >>>
>> >>> > >
>> >>> > >
>> >>> > > _______________________________________________
>> >>> > > Xen-devel mailing list
>> >>> > > Xen-devel@lists.xen.org
>> >>> > > https://lists.xen.org/xen-devel
>> >>>
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-10 21:32                 ` Dan Streetman
@ 2017-01-10 23:28                   ` Boris Ostrovsky
  2017-01-11  1:25                   ` Stefano Stabellini
  1 sibling, 0 replies; 34+ messages in thread
From: Boris Ostrovsky @ 2017-01-10 23:28 UTC (permalink / raw)
  To: Dan Streetman, Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci


>> In the Xen code I can still see:
>>
>>     case ECS_PIRQ: {
>>         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>>
>>         if ( !pirq )
>>             break;
>>         if ( !is_hvm_domain(d1) )
>>             pirq_guest_unbind(d1, pirq);
>>
>> which means that pirq_guest_unbind should only be called on evtchn_close
>> if the guest is not an HVM guest.

A few lines further down we have:

        pirq->evtchn = 0;
        pirq_cleanup_check(pirq, d1);
        unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]);
#ifdef CONFIG_X86
        if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
            unmap_domain_pirq_emuirq(d1, pirq->pirq);
#endif

which is where we free up the pirq.

-boris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-10 21:32                 ` Dan Streetman
  2017-01-10 23:28                   ` Boris Ostrovsky
@ 2017-01-11  1:25                   ` Stefano Stabellini
  2017-01-11 15:26                     ` Dan Streetman
  1 sibling, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-11  1:25 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Tue, 10 Jan 2017, Dan Streetman wrote:
> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
> <sstabellini@kernel.org> wrote:
> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> >> > <sstabellini@kernel.org> wrote:
> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >> >>> > <boris.ostrovsky@oracle.com> wrote:
> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >> >>> > >>> allocated from the hypervisor.
> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >> >>> > >> as we are not reusing them?
> >> >>> > >
> >> >>> > > Don't we free the pirq when we unmap it?
> >> >>> >
> >> >>> > I think this is actually a bit worse than I initially thought.  After
> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >> >>> > allocation:
> >> >>>
> >> >>> Lets include Stefano,
> >> >>>
> >> >>> Thank you for digging in this! This has quite the deja-vu
> >> >>> feeling as I believe I hit this at some point in the past and
> >> >>> posted some possible ways of fixing this. But sadly I can't
> >> >>> find the thread.
> >> >>
> >> >> This issue seems to be caused by:
> >> >>
> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
> >> >>
> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
> >> >>
> >> >> which was a fix to a bug:
> >> >>
> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >> >>     try to assign a new pirq anyway.
> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
> >> >>     device.
> >> >>
> >> >> I don't remember any of the details. From the description of this bug,
> >> >> it seems that Xen changed behavior in the past few years: before it used
> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >> >> have been completely freed, then reassigned to somebody else the way it
> >> >> is described in this email.
> >> >>
> >> >> I think we should indentify the changeset or Xen version that introduced
> >> >> the new behavior. If it is old enough, we might be able to just revert
> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >> >> behavior conditional to the Xen version.
> >> >
> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
> >> > That's where I'm seeing this problem, with PT NVMe devices.
> >
> > They are the main ones. It is possible to have emulated virtio devices
> > with emulated MSIs, for example virtio-net. Althought they are not in
> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
> > too.
> >
> >
> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> >> > with the patch from this thread (which essentially reverts your commit
> >> > above) as well as some added debug to see the pirq numbers, cycles of
> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> >> > hypervisor provides essentially the same pirqs for each modprobe,
> >> > since they were freed by the rmmod.
> >
> > I am fine with reverting the old patch, but we need to understand what
> > caused the change in behavior first. Maybe somebody else with a Xen PCI
> > passthrough setup at hand can help with that.
> >
> > In the Xen code I can still see:
> >
> >     case ECS_PIRQ: {
> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
> >
> >         if ( !pirq )
> >             break;
> >         if ( !is_hvm_domain(d1) )
> >             pirq_guest_unbind(d1, pirq);
> >
> > which means that pirq_guest_unbind should only be called on evtchn_close
> > if the guest is not an HVM guest.
> 
> I tried an experiment to call get_free_pirq on both sides of a
> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
> see something interesting; each nic uses 3 MSIs, and it looks like
> when they shut down, each nic's 3 pirqs are not available until after
> the nic is done shutting down, so it seems like closing the evtchn
> isn't what makes the pirq free.
> 
> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
> 
> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
> 
> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
> 0, xen_pvh_domain() == 0
> 
> just to be sure, a bit of dbg so I know what domain this is :-)
> 
> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
> 
> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
> none of those become available for get_free_pirq.
> 
> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
> 
> aha, now nic 4 has fully finished shutting down, and nic 3 has started
> shutdown; we see those pirqs from nic 4 are now available.  So it must
> not be evtchn closing that frees the pirqs.
> 
> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
> 
> 
> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
> (and recompiled/rebooted) and found the pirqs have already been freed
> before that is called:
> 
> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
> 
> 
> I'm still following thru the kernel code, but it's not immediately
> obvious what exactly is telling the hypervisor to free the pirqs; any
> idea?
> 
> >From the hypervisor code, it seems that the pirq is "available" via
> is_free_pirq():
>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> 
> when the evtchn is closed, it does:
>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
> 
> and that call to unmap_domain_pirq_emuirq does:
>         info->arch.hvm.emuirq = IRQ_UNBOUND;
> 
> so, the only thing left is to clear pirq->arch.irq,but the only place
> I can find that does that is clear_domain_irq_pirq(), which is only
> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
> seeing where either of those would be called when all the kernel is
> doing is disabling a pci device.

Thanks for the info. I think I know what causes the pirq to be unmapped:
when Linux disables msi or msix on the device, using the regular pci
config space based method, QEMU (which emulates the PCI config space)
tells Xen to unmap the pirq.

If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
to be called a second time without msis being disabled first, then I
think we can revert the patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-11  1:25                   ` Stefano Stabellini
@ 2017-01-11 15:26                     ` Dan Streetman
  2017-01-11 18:46                       ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-11 15:26 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
<sstabellini@kernel.org> wrote:
> On Tue, 10 Jan 2017, Dan Streetman wrote:
>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
>> <sstabellini@kernel.org> wrote:
>> > On Tue, 10 Jan 2017, Dan Streetman wrote:
>> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>> >> > <sstabellini@kernel.org> wrote:
>> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>> >> >>> > <boris.ostrovsky@oracle.com> wrote:
>> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
>> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
>> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
>> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
>> >> >>> > >>> allocated from the hypervisor.
>> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>> >> >>> > >> as we are not reusing them?
>> >> >>> > >
>> >> >>> > > Don't we free the pirq when we unmap it?
>> >> >>> >
>> >> >>> > I think this is actually a bit worse than I initially thought.  After
>> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
>> >> >>> > allocation:
>> >> >>>
>> >> >>> Lets include Stefano,
>> >> >>>
>> >> >>> Thank you for digging in this! This has quite the deja-vu
>> >> >>> feeling as I believe I hit this at some point in the past and
>> >> >>> posted some possible ways of fixing this. But sadly I can't
>> >> >>> find the thread.
>> >> >>
>> >> >> This issue seems to be caused by:
>> >> >>
>> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
>> >> >>
>> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
>> >> >>
>> >> >> which was a fix to a bug:
>> >> >>
>> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
>> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>> >> >>     try to assign a new pirq anyway.
>> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
>> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
>> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
>> >> >>     device.
>> >> >>
>> >> >> I don't remember any of the details. From the description of this bug,
>> >> >> it seems that Xen changed behavior in the past few years: before it used
>> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>> >> >> have been completely freed, then reassigned to somebody else the way it
>> >> >> is described in this email.
>> >> >>
>> >> >> I think we should indentify the changeset or Xen version that introduced
>> >> >> the new behavior. If it is old enough, we might be able to just revert
>> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>> >> >> behavior conditional to the Xen version.
>> >> >
>> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
>> >> > That's where I'm seeing this problem, with PT NVMe devices.
>> >
>> > They are the main ones. It is possible to have emulated virtio devices
>> > with emulated MSIs, for example virtio-net. Althought they are not in
>> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
>> > too.
>> >
>> >
>> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
>> >> > with the patch from this thread (which essentially reverts your commit
>> >> > above) as well as some added debug to see the pirq numbers, cycles of
>> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>> >> > hypervisor provides essentially the same pirqs for each modprobe,
>> >> > since they were freed by the rmmod.
>> >
>> > I am fine with reverting the old patch, but we need to understand what
>> > caused the change in behavior first. Maybe somebody else with a Xen PCI
>> > passthrough setup at hand can help with that.
>> >
>> > In the Xen code I can still see:
>> >
>> >     case ECS_PIRQ: {
>> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>> >
>> >         if ( !pirq )
>> >             break;
>> >         if ( !is_hvm_domain(d1) )
>> >             pirq_guest_unbind(d1, pirq);
>> >
>> > which means that pirq_guest_unbind should only be called on evtchn_close
>> > if the guest is not an HVM guest.
>>
>> I tried an experiment to call get_free_pirq on both sides of a
>> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
>> see something interesting; each nic uses 3 MSIs, and it looks like
>> when they shut down, each nic's 3 pirqs are not available until after
>> the nic is done shutting down, so it seems like closing the evtchn
>> isn't what makes the pirq free.
>>
>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
>>
>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
>>
>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
>> 0, xen_pvh_domain() == 0
>>
>> just to be sure, a bit of dbg so I know what domain this is :-)
>>
>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
>>
>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
>> none of those become available for get_free_pirq.
>>
>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
>>
>> aha, now nic 4 has fully finished shutting down, and nic 3 has started
>> shutdown; we see those pirqs from nic 4 are now available.  So it must
>> not be evtchn closing that frees the pirqs.
>>
>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
>>
>>
>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
>> (and recompiled/rebooted) and found the pirqs have already been freed
>> before that is called:
>>
>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
>>
>>
>> I'm still following thru the kernel code, but it's not immediately
>> obvious what exactly is telling the hypervisor to free the pirqs; any
>> idea?
>>
>> >From the hypervisor code, it seems that the pirq is "available" via
>> is_free_pirq():
>>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
>>
>> when the evtchn is closed, it does:
>>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
>>
>> and that call to unmap_domain_pirq_emuirq does:
>>         info->arch.hvm.emuirq = IRQ_UNBOUND;
>>
>> so, the only thing left is to clear pirq->arch.irq,but the only place
>> I can find that does that is clear_domain_irq_pirq(), which is only
>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
>> seeing where either of those would be called when all the kernel is
>> doing is disabling a pci device.
>
> Thanks for the info. I think I know what causes the pirq to be unmapped:
> when Linux disables msi or msix on the device, using the regular pci
> config space based method, QEMU (which emulates the PCI config space)
> tells Xen to unmap the pirq.

aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.

>
> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
> to be called a second time without msis being disabled first, then I
> think we can revert the patch.

It doesn't seem possible to call it twice from a correctly-behaved
driver, but in case of a driver bug that does try to enable msi/msix
multiple times without disabling, __pci_enable_msix() only does
WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
should be changed to warn and also return error, to prevent
re-configuring msi/msix if already configured?  Or, maybe the warning
is enough - the worst thing that reverting the patch does is use extra
pirqs, right?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-11 15:26                     ` Dan Streetman
@ 2017-01-11 18:46                       ` Stefano Stabellini
  2017-01-11 23:25                         ` Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-11 18:46 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Wed, 11 Jan 2017, Dan Streetman wrote:
> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
> <sstabellini@kernel.org> wrote:
> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
> >> <sstabellini@kernel.org> wrote:
> >> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> >> >> > <sstabellini@kernel.org> wrote:
> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >> >> >>> > <boris.ostrovsky@oracle.com> wrote:
> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >> >> >>> > >>> allocated from the hypervisor.
> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >> >> >>> > >> as we are not reusing them?
> >> >> >>> > >
> >> >> >>> > > Don't we free the pirq when we unmap it?
> >> >> >>> >
> >> >> >>> > I think this is actually a bit worse than I initially thought.  After
> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >> >> >>> > allocation:
> >> >> >>>
> >> >> >>> Lets include Stefano,
> >> >> >>>
> >> >> >>> Thank you for digging in this! This has quite the deja-vu
> >> >> >>> feeling as I believe I hit this at some point in the past and
> >> >> >>> posted some possible ways of fixing this. But sadly I can't
> >> >> >>> find the thread.
> >> >> >>
> >> >> >> This issue seems to be caused by:
> >> >> >>
> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
> >> >> >>
> >> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
> >> >> >>
> >> >> >> which was a fix to a bug:
> >> >> >>
> >> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
> >> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >> >> >>     try to assign a new pirq anyway.
> >> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
> >> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
> >> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
> >> >> >>     device.
> >> >> >>
> >> >> >> I don't remember any of the details. From the description of this bug,
> >> >> >> it seems that Xen changed behavior in the past few years: before it used
> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >> >> >> have been completely freed, then reassigned to somebody else the way it
> >> >> >> is described in this email.
> >> >> >>
> >> >> >> I think we should indentify the changeset or Xen version that introduced
> >> >> >> the new behavior. If it is old enough, we might be able to just revert
> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >> >> >> behavior conditional to the Xen version.
> >> >> >
> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
> >> >> > That's where I'm seeing this problem, with PT NVMe devices.
> >> >
> >> > They are the main ones. It is possible to have emulated virtio devices
> >> > with emulated MSIs, for example virtio-net. Althought they are not in
> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
> >> > too.
> >> >
> >> >
> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> >> >> > with the patch from this thread (which essentially reverts your commit
> >> >> > above) as well as some added debug to see the pirq numbers, cycles of
> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> >> >> > hypervisor provides essentially the same pirqs for each modprobe,
> >> >> > since they were freed by the rmmod.
> >> >
> >> > I am fine with reverting the old patch, but we need to understand what
> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI
> >> > passthrough setup at hand can help with that.
> >> >
> >> > In the Xen code I can still see:
> >> >
> >> >     case ECS_PIRQ: {
> >> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
> >> >
> >> >         if ( !pirq )
> >> >             break;
> >> >         if ( !is_hvm_domain(d1) )
> >> >             pirq_guest_unbind(d1, pirq);
> >> >
> >> > which means that pirq_guest_unbind should only be called on evtchn_close
> >> > if the guest is not an HVM guest.
> >>
> >> I tried an experiment to call get_free_pirq on both sides of a
> >> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
> >> see something interesting; each nic uses 3 MSIs, and it looks like
> >> when they shut down, each nic's 3 pirqs are not available until after
> >> the nic is done shutting down, so it seems like closing the evtchn
> >> isn't what makes the pirq free.
> >>
> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
> >>
> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
> >>
> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
> >> 0, xen_pvh_domain() == 0
> >>
> >> just to be sure, a bit of dbg so I know what domain this is :-)
> >>
> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
> >>
> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
> >> none of those become available for get_free_pirq.
> >>
> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
> >>
> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started
> >> shutdown; we see those pirqs from nic 4 are now available.  So it must
> >> not be evtchn closing that frees the pirqs.
> >>
> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
> >>
> >>
> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
> >> (and recompiled/rebooted) and found the pirqs have already been freed
> >> before that is called:
> >>
> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
> >>
> >>
> >> I'm still following thru the kernel code, but it's not immediately
> >> obvious what exactly is telling the hypervisor to free the pirqs; any
> >> idea?
> >>
> >> >From the hypervisor code, it seems that the pirq is "available" via
> >> is_free_pirq():
> >>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
> >>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> >>
> >> when the evtchn is closed, it does:
> >>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
> >>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
> >>
> >> and that call to unmap_domain_pirq_emuirq does:
> >>         info->arch.hvm.emuirq = IRQ_UNBOUND;
> >>
> >> so, the only thing left is to clear pirq->arch.irq,but the only place
> >> I can find that does that is clear_domain_irq_pirq(), which is only
> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
> >> seeing where either of those would be called when all the kernel is
> >> doing is disabling a pci device.
> >
> > Thanks for the info. I think I know what causes the pirq to be unmapped:
> > when Linux disables msi or msix on the device, using the regular pci
> > config space based method, QEMU (which emulates the PCI config space)
> > tells Xen to unmap the pirq.
> 
> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
> 
> >
> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
> > to be called a second time without msis being disabled first, then I
> > think we can revert the patch.
> 
> It doesn't seem possible to call it twice from a correctly-behaved
> driver, but in case of a driver bug that does try to enable msi/msix
> multiple times without disabling, __pci_enable_msix() only does
> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
> should be changed to warn and also return error, to prevent
> re-configuring msi/msix if already configured?  Or, maybe the warning
> is enough - the worst thing that reverting the patch does is use extra
> pirqs, right?

I think the warning is enough.  Can you confirm that with
af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also

ifconfig eth0 down; ifconfig eth0 up

still work as expected, no warnings?


It looks like the patch that changed hypervisor (QEMU actually) behavior
is:

commit c976437c7dba9c7444fb41df45468968aaa326ad
Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Date:   Wed May 7 13:41:48 2014 +0000

    qemu-xen: free all the pirqs for msi/msix when driver unload

>From this commit onward, QEMU started unmapping pirqs when MSIs are
disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.

If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
and older. Given that Xen 4.4 is out of support, I think we should go
ahead with it.  Opinions?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-11 18:46                       ` Stefano Stabellini
@ 2017-01-11 23:25                         ` Dan Streetman
  2017-01-13 18:31                           ` Dan Streetman
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-11 23:25 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini
<sstabellini@kernel.org> wrote:
> On Wed, 11 Jan 2017, Dan Streetman wrote:
>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
>> <sstabellini@kernel.org> wrote:
>> > On Tue, 10 Jan 2017, Dan Streetman wrote:
>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
>> >> <sstabellini@kernel.org> wrote:
>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote:
>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>> >> >> > <sstabellini@kernel.org> wrote:
>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote:
>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
>> >> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
>> >> >> >>> > >>> allocated from the hypervisor.
>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>> >> >> >>> > >> as we are not reusing them?
>> >> >> >>> > >
>> >> >> >>> > > Don't we free the pirq when we unmap it?
>> >> >> >>> >
>> >> >> >>> > I think this is actually a bit worse than I initially thought.  After
>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
>> >> >> >>> > allocation:
>> >> >> >>>
>> >> >> >>> Lets include Stefano,
>> >> >> >>>
>> >> >> >>> Thank you for digging in this! This has quite the deja-vu
>> >> >> >>> feeling as I believe I hit this at some point in the past and
>> >> >> >>> posted some possible ways of fixing this. But sadly I can't
>> >> >> >>> find the thread.
>> >> >> >>
>> >> >> >> This issue seems to be caused by:
>> >> >> >>
>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> >> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
>> >> >> >>
>> >> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
>> >> >> >>
>> >> >> >> which was a fix to a bug:
>> >> >> >>
>> >> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>> >> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
>> >> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>> >> >> >>     try to assign a new pirq anyway.
>> >> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
>> >> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
>> >> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
>> >> >> >>     device.
>> >> >> >>
>> >> >> >> I don't remember any of the details. From the description of this bug,
>> >> >> >> it seems that Xen changed behavior in the past few years: before it used
>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>> >> >> >> have been completely freed, then reassigned to somebody else the way it
>> >> >> >> is described in this email.
>> >> >> >>
>> >> >> >> I think we should indentify the changeset or Xen version that introduced
>> >> >> >> the new behavior. If it is old enough, we might be able to just revert
>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>> >> >> >> behavior conditional to the Xen version.
>> >> >> >
>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
>> >> >> > That's where I'm seeing this problem, with PT NVMe devices.
>> >> >
>> >> > They are the main ones. It is possible to have emulated virtio devices
>> >> > with emulated MSIs, for example virtio-net. Althought they are not in
>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
>> >> > too.
>> >> >
>> >> >
>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
>> >> >> > with the patch from this thread (which essentially reverts your commit
>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of
>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>> >> >> > hypervisor provides essentially the same pirqs for each modprobe,
>> >> >> > since they were freed by the rmmod.
>> >> >
>> >> > I am fine with reverting the old patch, but we need to understand what
>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI
>> >> > passthrough setup at hand can help with that.
>> >> >
>> >> > In the Xen code I can still see:
>> >> >
>> >> >     case ECS_PIRQ: {
>> >> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>> >> >
>> >> >         if ( !pirq )
>> >> >             break;
>> >> >         if ( !is_hvm_domain(d1) )
>> >> >             pirq_guest_unbind(d1, pirq);
>> >> >
>> >> > which means that pirq_guest_unbind should only be called on evtchn_close
>> >> > if the guest is not an HVM guest.
>> >>
>> >> I tried an experiment to call get_free_pirq on both sides of a
>> >> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
>> >> see something interesting; each nic uses 3 MSIs, and it looks like
>> >> when they shut down, each nic's 3 pirqs are not available until after
>> >> the nic is done shutting down, so it seems like closing the evtchn
>> >> isn't what makes the pirq free.
>> >>
>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
>> >>
>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
>> >>
>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
>> >> 0, xen_pvh_domain() == 0
>> >>
>> >> just to be sure, a bit of dbg so I know what domain this is :-)
>> >>
>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
>> >>
>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
>> >> none of those become available for get_free_pirq.
>> >>
>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
>> >>
>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started
>> >> shutdown; we see those pirqs from nic 4 are now available.  So it must
>> >> not be evtchn closing that frees the pirqs.
>> >>
>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
>> >>
>> >>
>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
>> >> (and recompiled/rebooted) and found the pirqs have already been freed
>> >> before that is called:
>> >>
>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
>> >>
>> >>
>> >> I'm still following thru the kernel code, but it's not immediately
>> >> obvious what exactly is telling the hypervisor to free the pirqs; any
>> >> idea?
>> >>
>> >> >From the hypervisor code, it seems that the pirq is "available" via
>> >> is_free_pirq():
>> >>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>> >>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
>> >>
>> >> when the evtchn is closed, it does:
>> >>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>> >>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
>> >>
>> >> and that call to unmap_domain_pirq_emuirq does:
>> >>         info->arch.hvm.emuirq = IRQ_UNBOUND;
>> >>
>> >> so, the only thing left is to clear pirq->arch.irq,but the only place
>> >> I can find that does that is clear_domain_irq_pirq(), which is only
>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
>> >> seeing where either of those would be called when all the kernel is
>> >> doing is disabling a pci device.
>> >
>> > Thanks for the info. I think I know what causes the pirq to be unmapped:
>> > when Linux disables msi or msix on the device, using the regular pci
>> > config space based method, QEMU (which emulates the PCI config space)
>> > tells Xen to unmap the pirq.
>>
>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
>>
>> >
>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
>> > to be called a second time without msis being disabled first, then I
>> > think we can revert the patch.
>>
>> It doesn't seem possible to call it twice from a correctly-behaved
>> driver, but in case of a driver bug that does try to enable msi/msix
>> multiple times without disabling, __pci_enable_msix() only does
>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
>> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
>> should be changed to warn and also return error, to prevent
>> re-configuring msi/msix if already configured?  Or, maybe the warning
>> is enough - the worst thing that reverting the patch does is use extra
>> pirqs, right?
>
> I think the warning is enough.  Can you confirm that with
> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also
>
> ifconfig eth0 down; ifconfig eth0 up
>
> still work as expected, no warnings?

yes, with the patch that started this thread - which essentially
reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no
warnings and ifconfig down ; ifconfig up work as expected.

>
>
> It looks like the patch that changed hypervisor (QEMU actually) behavior
> is:
>
> commit c976437c7dba9c7444fb41df45468968aaa326ad
> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
> Date:   Wed May 7 13:41:48 2014 +0000
>
>     qemu-xen: free all the pirqs for msi/msix when driver unload
>
> From this commit onward, QEMU started unmapping pirqs when MSIs are
> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.
>
> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
> and older. Given that Xen 4.4 is out of support, I think we should go
> ahead with it.  Opinions?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-11 23:25                         ` Dan Streetman
@ 2017-01-13 18:31                           ` Dan Streetman
  2017-01-13 18:44                             ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-01-13 18:31 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote:
> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini
> <sstabellini@kernel.org> wrote:
>> On Wed, 11 Jan 2017, Dan Streetman wrote:
>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
>>> <sstabellini@kernel.org> wrote:
>>> > On Tue, 10 Jan 2017, Dan Streetman wrote:
>>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
>>> >> <sstabellini@kernel.org> wrote:
>>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote:
>>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>>> >> >> > <sstabellini@kernel.org> wrote:
>>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote:
>>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
>>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
>>> >> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
>>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
>>> >> >> >>> > >>> allocated from the hypervisor.
>>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>>> >> >> >>> > >> as we are not reusing them?
>>> >> >> >>> > >
>>> >> >> >>> > > Don't we free the pirq when we unmap it?
>>> >> >> >>> >
>>> >> >> >>> > I think this is actually a bit worse than I initially thought.  After
>>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
>>> >> >> >>> > allocation:
>>> >> >> >>>
>>> >> >> >>> Lets include Stefano,
>>> >> >> >>>
>>> >> >> >>> Thank you for digging in this! This has quite the deja-vu
>>> >> >> >>> feeling as I believe I hit this at some point in the past and
>>> >> >> >>> posted some possible ways of fixing this. But sadly I can't
>>> >> >> >>> find the thread.
>>> >> >> >>
>>> >> >> >> This issue seems to be caused by:
>>> >> >> >>
>>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>> >> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
>>> >> >> >>
>>> >> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
>>> >> >> >>
>>> >> >> >> which was a fix to a bug:
>>> >> >> >>
>>> >> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>>> >> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
>>> >> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>>> >> >> >>     try to assign a new pirq anyway.
>>> >> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
>>> >> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
>>> >> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
>>> >> >> >>     device.
>>> >> >> >>
>>> >> >> >> I don't remember any of the details. From the description of this bug,
>>> >> >> >> it seems that Xen changed behavior in the past few years: before it used
>>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>>> >> >> >> have been completely freed, then reassigned to somebody else the way it
>>> >> >> >> is described in this email.
>>> >> >> >>
>>> >> >> >> I think we should indentify the changeset or Xen version that introduced
>>> >> >> >> the new behavior. If it is old enough, we might be able to just revert
>>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>>> >> >> >> behavior conditional to the Xen version.
>>> >> >> >
>>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
>>> >> >> > That's where I'm seeing this problem, with PT NVMe devices.
>>> >> >
>>> >> > They are the main ones. It is possible to have emulated virtio devices
>>> >> > with emulated MSIs, for example virtio-net. Althought they are not in
>>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
>>> >> > too.
>>> >> >
>>> >> >
>>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
>>> >> >> > with the patch from this thread (which essentially reverts your commit
>>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of
>>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>>> >> >> > hypervisor provides essentially the same pirqs for each modprobe,
>>> >> >> > since they were freed by the rmmod.
>>> >> >
>>> >> > I am fine with reverting the old patch, but we need to understand what
>>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI
>>> >> > passthrough setup at hand can help with that.
>>> >> >
>>> >> > In the Xen code I can still see:
>>> >> >
>>> >> >     case ECS_PIRQ: {
>>> >> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>>> >> >
>>> >> >         if ( !pirq )
>>> >> >             break;
>>> >> >         if ( !is_hvm_domain(d1) )
>>> >> >             pirq_guest_unbind(d1, pirq);
>>> >> >
>>> >> > which means that pirq_guest_unbind should only be called on evtchn_close
>>> >> > if the guest is not an HVM guest.
>>> >>
>>> >> I tried an experiment to call get_free_pirq on both sides of a
>>> >> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
>>> >> see something interesting; each nic uses 3 MSIs, and it looks like
>>> >> when they shut down, each nic's 3 pirqs are not available until after
>>> >> the nic is done shutting down, so it seems like closing the evtchn
>>> >> isn't what makes the pirq free.
>>> >>
>>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
>>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
>>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
>>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
>>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
>>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
>>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
>>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
>>> >>
>>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
>>> >>
>>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
>>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
>>> >> 0, xen_pvh_domain() == 0
>>> >>
>>> >> just to be sure, a bit of dbg so I know what domain this is :-)
>>> >>
>>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
>>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
>>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
>>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
>>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
>>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
>>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
>>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
>>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
>>> >>
>>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
>>> >> none of those become available for get_free_pirq.
>>> >>
>>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
>>> >>
>>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started
>>> >> shutdown; we see those pirqs from nic 4 are now available.  So it must
>>> >> not be evtchn closing that frees the pirqs.
>>> >>
>>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
>>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
>>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
>>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
>>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
>>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
>>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
>>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
>>> >>
>>> >>
>>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
>>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
>>> >> (and recompiled/rebooted) and found the pirqs have already been freed
>>> >> before that is called:
>>> >>
>>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
>>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
>>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
>>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
>>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
>>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
>>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
>>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
>>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
>>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
>>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
>>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
>>> >>
>>> >>
>>> >> I'm still following thru the kernel code, but it's not immediately
>>> >> obvious what exactly is telling the hypervisor to free the pirqs; any
>>> >> idea?
>>> >>
>>> >> >From the hypervisor code, it seems that the pirq is "available" via
>>> >> is_free_pirq():
>>> >>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>>> >>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
>>> >>
>>> >> when the evtchn is closed, it does:
>>> >>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>>> >>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
>>> >>
>>> >> and that call to unmap_domain_pirq_emuirq does:
>>> >>         info->arch.hvm.emuirq = IRQ_UNBOUND;
>>> >>
>>> >> so, the only thing left is to clear pirq->arch.irq,but the only place
>>> >> I can find that does that is clear_domain_irq_pirq(), which is only
>>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
>>> >> seeing where either of those would be called when all the kernel is
>>> >> doing is disabling a pci device.
>>> >
>>> > Thanks for the info. I think I know what causes the pirq to be unmapped:
>>> > when Linux disables msi or msix on the device, using the regular pci
>>> > config space based method, QEMU (which emulates the PCI config space)
>>> > tells Xen to unmap the pirq.
>>>
>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
>>>
>>> >
>>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
>>> > to be called a second time without msis being disabled first, then I
>>> > think we can revert the patch.
>>>
>>> It doesn't seem possible to call it twice from a correctly-behaved
>>> driver, but in case of a driver bug that does try to enable msi/msix
>>> multiple times without disabling, __pci_enable_msix() only does
>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
>>> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
>>> should be changed to warn and also return error, to prevent
>>> re-configuring msi/msix if already configured?  Or, maybe the warning
>>> is enough - the worst thing that reverting the patch does is use extra
>>> pirqs, right?
>>
>> I think the warning is enough.  Can you confirm that with
>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also
>>
>> ifconfig eth0 down; ifconfig eth0 up
>>
>> still work as expected, no warnings?
>
> yes, with the patch that started this thread - which essentially
> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no
> warnings and ifconfig down ; ifconfig up work as expected.
>
>>
>>
>> It looks like the patch that changed hypervisor (QEMU actually) behavior
>> is:
>>
>> commit c976437c7dba9c7444fb41df45468968aaa326ad
>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
>> Date:   Wed May 7 13:41:48 2014 +0000
>>
>>     qemu-xen: free all the pirqs for msi/msix when driver unload
>>
>> From this commit onward, QEMU started unmapping pirqs when MSIs are
>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.
>>
>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
>> and older. Given that Xen 4.4 is out of support, I think we should go
>> ahead with it.  Opinions?

Looks like there's no complaints; is my patch from the start of this
thread ok to use, or can you craft a patch to use?  My patch's
description could use updating to add some of the info/background from
this discussion...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 18:31                           ` Dan Streetman
@ 2017-01-13 18:44                             ` Stefano Stabellini
  2017-01-13 20:00                               ` Boris Ostrovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-13 18:44 UTC (permalink / raw)
  To: jgross, boris.ostrovsky
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci, ddstreet

On Fri, 13 Jan 2017, Dan Streetman wrote:
> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote:
> > On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini
> > <sstabellini@kernel.org> wrote:
> >> On Wed, 11 Jan 2017, Dan Streetman wrote:
> >>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
> >>> <sstabellini@kernel.org> wrote:
> >>> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
> >>> >> <sstabellini@kernel.org> wrote:
> >>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
> >>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> >>> >> >> > <sstabellini@kernel.org> wrote:
> >>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote:
> >>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >>> >> >> >>> > >>> unmapped and may now be in use by another pci device.  The previous
> >>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >>> >> >> >>> > >>> allocated from the hypervisor.
> >>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >>> >> >> >>> > >> as we are not reusing them?
> >>> >> >> >>> > >
> >>> >> >> >>> > > Don't we free the pirq when we unmap it?
> >>> >> >> >>> >
> >>> >> >> >>> > I think this is actually a bit worse than I initially thought.  After
> >>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >>> >> >> >>> > allocation:
> >>> >> >> >>>
> >>> >> >> >>> Lets include Stefano,
> >>> >> >> >>>
> >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu
> >>> >> >> >>> feeling as I believe I hit this at some point in the past and
> >>> >> >> >>> posted some possible ways of fixing this. But sadly I can't
> >>> >> >> >>> find the thread.
> >>> >> >> >>
> >>> >> >> >> This issue seems to be caused by:
> >>> >> >> >>
> >>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >>> >> >> >> Date:   Wed Dec 1 14:51:44 2010 +0000
> >>> >> >> >>
> >>> >> >> >>     xen: fix MSI setup and teardown for PV on HVM guests
> >>> >> >> >>
> >>> >> >> >> which was a fix to a bug:
> >>> >> >> >>
> >>> >> >> >>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >>> >> >> >>     trying to enable the same MSI for the second time: the old MSI to pirq
> >>> >> >> >>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >>> >> >> >>     try to assign a new pirq anyway.
> >>> >> >> >>     A simple way to reproduce this bug is to assign an MSI capable network
> >>> >> >> >>     card to a PV on HVM guest, if the user brings down the corresponding
> >>> >> >> >>     ethernet interface and up again, Linux would fail to enable MSIs on the
> >>> >> >> >>     device.
> >>> >> >> >>
> >>> >> >> >> I don't remember any of the details. From the description of this bug,
> >>> >> >> >> it seems that Xen changed behavior in the past few years: before it used
> >>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >>> >> >> >> have been completely freed, then reassigned to somebody else the way it
> >>> >> >> >> is described in this email.
> >>> >> >> >>
> >>> >> >> >> I think we should indentify the changeset or Xen version that introduced
> >>> >> >> >> the new behavior. If it is old enough, we might be able to just revert
> >>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >>> >> >> >> behavior conditional to the Xen version.
> >>> >> >> >
> >>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
> >>> >> >> > That's where I'm seeing this problem, with PT NVMe devices.
> >>> >> >
> >>> >> > They are the main ones. It is possible to have emulated virtio devices
> >>> >> > with emulated MSIs, for example virtio-net. Althought they are not in
> >>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
> >>> >> > too.
> >>> >> >
> >>> >> >
> >>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> >>> >> >> > with the patch from this thread (which essentially reverts your commit
> >>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of
> >>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> >>> >> >> > hypervisor provides essentially the same pirqs for each modprobe,
> >>> >> >> > since they were freed by the rmmod.
> >>> >> >
> >>> >> > I am fine with reverting the old patch, but we need to understand what
> >>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI
> >>> >> > passthrough setup at hand can help with that.
> >>> >> >
> >>> >> > In the Xen code I can still see:
> >>> >> >
> >>> >> >     case ECS_PIRQ: {
> >>> >> >         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
> >>> >> >
> >>> >> >         if ( !pirq )
> >>> >> >             break;
> >>> >> >         if ( !is_hvm_domain(d1) )
> >>> >> >             pirq_guest_unbind(d1, pirq);
> >>> >> >
> >>> >> > which means that pirq_guest_unbind should only be called on evtchn_close
> >>> >> > if the guest is not an HVM guest.
> >>> >>
> >>> >> I tried an experiment to call get_free_pirq on both sides of a
> >>> >> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
> >>> >> see something interesting; each nic uses 3 MSIs, and it looks like
> >>> >> when they shut down, each nic's 3 pirqs are not available until after
> >>> >> the nic is done shutting down, so it seems like closing the evtchn
> >>> >> isn't what makes the pirq free.
> >>> >>
> >>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
> >>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
> >>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
> >>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
> >>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
> >>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
> >>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
> >>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
> >>> >>
> >>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
> >>> >>
> >>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
> >>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
> >>> >> 0, xen_pvh_domain() == 0
> >>> >>
> >>> >> just to be sure, a bit of dbg so I know what domain this is :-)
> >>> >>
> >>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
> >>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
> >>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
> >>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
> >>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
> >>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
> >>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
> >>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
> >>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
> >>> >>
> >>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
> >>> >> none of those become available for get_free_pirq.
> >>> >>
> >>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
> >>> >>
> >>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started
> >>> >> shutdown; we see those pirqs from nic 4 are now available.  So it must
> >>> >> not be evtchn closing that frees the pirqs.
> >>> >>
> >>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
> >>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
> >>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
> >>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
> >>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
> >>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
> >>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
> >>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
> >>> >>
> >>> >>
> >>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
> >>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
> >>> >> (and recompiled/rebooted) and found the pirqs have already been freed
> >>> >> before that is called:
> >>> >>
> >>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
> >>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
> >>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
> >>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
> >>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
> >>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
> >>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
> >>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
> >>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
> >>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
> >>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
> >>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
> >>> >>
> >>> >>
> >>> >> I'm still following thru the kernel code, but it's not immediately
> >>> >> obvious what exactly is telling the hypervisor to free the pirqs; any
> >>> >> idea?
> >>> >>
> >>> >> >From the hypervisor code, it seems that the pirq is "available" via
> >>> >> is_free_pirq():
> >>> >>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
> >>> >>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> >>> >>
> >>> >> when the evtchn is closed, it does:
> >>> >>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
> >>> >>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
> >>> >>
> >>> >> and that call to unmap_domain_pirq_emuirq does:
> >>> >>         info->arch.hvm.emuirq = IRQ_UNBOUND;
> >>> >>
> >>> >> so, the only thing left is to clear pirq->arch.irq,but the only place
> >>> >> I can find that does that is clear_domain_irq_pirq(), which is only
> >>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
> >>> >> seeing where either of those would be called when all the kernel is
> >>> >> doing is disabling a pci device.
> >>> >
> >>> > Thanks for the info. I think I know what causes the pirq to be unmapped:
> >>> > when Linux disables msi or msix on the device, using the regular pci
> >>> > config space based method, QEMU (which emulates the PCI config space)
> >>> > tells Xen to unmap the pirq.
> >>>
> >>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
> >>>
> >>> >
> >>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
> >>> > to be called a second time without msis being disabled first, then I
> >>> > think we can revert the patch.
> >>>
> >>> It doesn't seem possible to call it twice from a correctly-behaved
> >>> driver, but in case of a driver bug that does try to enable msi/msix
> >>> multiple times without disabling, __pci_enable_msix() only does
> >>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
> >>> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
> >>> should be changed to warn and also return error, to prevent
> >>> re-configuring msi/msix if already configured?  Or, maybe the warning
> >>> is enough - the worst thing that reverting the patch does is use extra
> >>> pirqs, right?
> >>
> >> I think the warning is enough.  Can you confirm that with
> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also
> >>
> >> ifconfig eth0 down; ifconfig eth0 up
> >>
> >> still work as expected, no warnings?
> >
> > yes, with the patch that started this thread - which essentially
> > reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no
> > warnings and ifconfig down ; ifconfig up work as expected.
> >
> >>
> >>
> >> It looks like the patch that changed hypervisor (QEMU actually) behavior
> >> is:
> >>
> >> commit c976437c7dba9c7444fb41df45468968aaa326ad
> >> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
> >> Date:   Wed May 7 13:41:48 2014 +0000
> >>
> >>     qemu-xen: free all the pirqs for msi/msix when driver unload
> >>
> >> From this commit onward, QEMU started unmapping pirqs when MSIs are
> >> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
> >> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.
> >>
> >> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
> >> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
> >> and older. Given that Xen 4.4 is out of support, I think we should go
> >> ahead with it.  Opinions?
> 
> Looks like there's no complaints; is my patch from the start of this
> thread ok to use, or can you craft a patch to use?  My patch's
> description could use updating to add some of the info/background from
> this discussion...

Hi Dan, I would like an explicit Ack from the other maintainers, Boris
and Juergen. Let me place them in To: to make it more obvious. 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 18:44                             ` Stefano Stabellini
@ 2017-01-13 20:00                               ` Boris Ostrovsky
  2017-01-13 20:07                                 ` Dan Streetman
  2017-01-13 20:13                                 ` Dan Streetman
  0 siblings, 2 replies; 34+ messages in thread
From: Boris Ostrovsky @ 2017-01-13 20:00 UTC (permalink / raw)
  To: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk
  Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci,
	ddstreet

On 01/13/2017 01:44 PM, Stefano Stabellini wrote:
> On Fri, 13 Jan 2017, Dan Streetman wrote:
>> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote:
>>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini
>>> <sstabellini@kernel.org> wrote:
>>>> On Wed, 11 Jan 2017, Dan Streetman wrote:
>>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
>>>>> <sstabellini@kernel.org> wrote:
>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote:
>>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
>>>>>>> <sstabellini@kernel.org> wrote:
>>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote:
>>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>>>>>>>>>> <sstabellini@kernel.org> wrote:
>>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was
>>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was
>>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device.  The previous
>>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be
>>>>>>>>>>>>>>>> allocated from the hypervisor.
>>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs
>>>>>>>>>>>>>>> as we are not reusing them?
>>>>>>>>>>>>>> Don't we free the pirq when we unmap it?
>>>>>>>>>>>>> I think this is actually a bit worse than I initially thought.  After
>>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq
>>>>>>>>>>>>> allocation:
>>>>>>>>>>>> Lets include Stefano,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu
>>>>>>>>>>>> feeling as I believe I hit this at some point in the past and
>>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't
>>>>>>>>>>>> find the thread.
>>>>>>>>>>> This issue seems to be caused by:
>>>>>>>>>>>
>>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>>>>>>>>>> Date:   Wed Dec 1 14:51:44 2010 +0000
>>>>>>>>>>>
>>>>>>>>>>>     xen: fix MSI setup and teardown for PV on HVM guests
>>>>>>>>>>>
>>>>>>>>>>> which was a fix to a bug:
>>>>>>>>>>>
>>>>>>>>>>>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>>>>>>>>>>>     trying to enable the same MSI for the second time: the old MSI to pirq
>>>>>>>>>>>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>>>>>>>>>>>     try to assign a new pirq anyway.
>>>>>>>>>>>     A simple way to reproduce this bug is to assign an MSI capable network
>>>>>>>>>>>     card to a PV on HVM guest, if the user brings down the corresponding
>>>>>>>>>>>     ethernet interface and up again, Linux would fail to enable MSIs on the
>>>>>>>>>>>     device.
>>>>>>>>>>>
>>>>>>>>>>> I don't remember any of the details. From the description of this bug,
>>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used
>>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it
>>>>>>>>>>> is described in this email.
>>>>>>>>>>>
>>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced
>>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert
>>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>>>>>>>>>>> behavior conditional to the Xen version.
>>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest?
>>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices.
>>>>>>>> They are the main ones. It is possible to have emulated virtio devices
>>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in
>>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken
>>>>>>>> too.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on,
>>>>>>>>>> with the patch from this thread (which essentially reverts your commit
>>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of
>>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe,
>>>>>>>>>> since they were freed by the rmmod.
>>>>>>>> I am fine with reverting the old patch, but we need to understand what
>>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI
>>>>>>>> passthrough setup at hand can help with that.
>>>>>>>>
>>>>>>>> In the Xen code I can still see:
>>>>>>>>
>>>>>>>>     case ECS_PIRQ: {
>>>>>>>>         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>>>>>>>>
>>>>>>>>         if ( !pirq )
>>>>>>>>             break;
>>>>>>>>         if ( !is_hvm_domain(d1) )
>>>>>>>>             pirq_guest_unbind(d1, pirq);
>>>>>>>>
>>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close
>>>>>>>> if the guest is not an HVM guest.
>>>>>>> I tried an experiment to call get_free_pirq on both sides of a
>>>>>>> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
>>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like
>>>>>>> when they shut down, each nic's 3 pirqs are not available until after
>>>>>>> the nic is done shutting down, so it seems like closing the evtchn
>>>>>>> isn't what makes the pirq free.
>>>>>>>
>>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
>>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
>>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
>>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
>>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
>>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
>>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
>>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
>>>>>>>
>>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
>>>>>>>
>>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
>>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
>>>>>>> 0, xen_pvh_domain() == 0
>>>>>>>
>>>>>>> just to be sure, a bit of dbg so I know what domain this is :-)
>>>>>>>
>>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
>>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
>>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
>>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
>>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
>>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
>>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
>>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
>>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
>>>>>>>
>>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
>>>>>>> none of those become available for get_free_pirq.
>>>>>>>
>>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
>>>>>>>
>>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started
>>>>>>> shutdown; we see those pirqs from nic 4 are now available.  So it must
>>>>>>> not be evtchn closing that frees the pirqs.
>>>>>>>
>>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
>>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
>>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
>>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
>>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
>>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
>>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
>>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
>>>>>>>
>>>>>>>
>>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
>>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
>>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed
>>>>>>> before that is called:
>>>>>>>
>>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
>>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
>>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
>>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
>>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
>>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
>>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
>>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
>>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
>>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
>>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
>>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
>>>>>>>
>>>>>>>
>>>>>>> I'm still following thru the kernel code, but it's not immediately
>>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any
>>>>>>> idea?
>>>>>>>
>>>>>>> >From the hypervisor code, it seems that the pirq is "available" via
>>>>>>> is_free_pirq():
>>>>>>>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>>>>>>>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
>>>>>>>
>>>>>>> when the evtchn is closed, it does:
>>>>>>>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>>>>>>>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
>>>>>>>
>>>>>>> and that call to unmap_domain_pirq_emuirq does:
>>>>>>>         info->arch.hvm.emuirq = IRQ_UNBOUND;
>>>>>>>
>>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place
>>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only
>>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
>>>>>>> seeing where either of those would be called when all the kernel is
>>>>>>> doing is disabling a pci device.
>>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped:
>>>>>> when Linux disables msi or msix on the device, using the regular pci
>>>>>> config space based method, QEMU (which emulates the PCI config space)
>>>>>> tells Xen to unmap the pirq.
>>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
>>>>>
>>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
>>>>>> to be called a second time without msis being disabled first, then I
>>>>>> think we can revert the patch.
>>>>> It doesn't seem possible to call it twice from a correctly-behaved
>>>>> driver, but in case of a driver bug that does try to enable msi/msix
>>>>> multiple times without disabling, __pci_enable_msix() only does
>>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
>>>>> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
>>>>> should be changed to warn and also return error, to prevent
>>>>> re-configuring msi/msix if already configured?  Or, maybe the warning
>>>>> is enough - the worst thing that reverting the patch does is use extra
>>>>> pirqs, right?
>>>> I think the warning is enough.  Can you confirm that with
>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also
>>>>
>>>> ifconfig eth0 down; ifconfig eth0 up
>>>>
>>>> still work as expected, no warnings?
>>> yes, with the patch that started this thread - which essentially
>>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no
>>> warnings and ifconfig down ; ifconfig up work as expected.
>>>
>>>>
>>>> It looks like the patch that changed hypervisor (QEMU actually) behavior
>>>> is:
>>>>
>>>> commit c976437c7dba9c7444fb41df45468968aaa326ad
>>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
>>>> Date:   Wed May 7 13:41:48 2014 +0000
>>>>
>>>>     qemu-xen: free all the pirqs for msi/msix when driver unload
>>>>
>>>> From this commit onward, QEMU started unmapping pirqs when MSIs are
>>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
>>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.
>>>>
>>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
>>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
>>>> and older. Given that Xen 4.4 is out of support, I think we should go
>>>> ahead with it.  Opinions?
>> Looks like there's no complaints; is my patch from the start of this
>> thread ok to use, or can you craft a patch to use?  My patch's
>> description could use updating to add some of the info/background from
>> this discussion...
> Hi Dan, I would like an explicit Ack from the other maintainers, Boris
> and Juergen. Let me place them in To: to make it more obvious. 


Where is the patch? I don't think 'git revert' will work.

And Konrad will need to ack it too as he is Xen-PCI maintainer.

-boris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 20:00                               ` Boris Ostrovsky
@ 2017-01-13 20:07                                 ` Dan Streetman
  2017-01-13 20:54                                   ` Stefano Stabellini
                                                     ` (2 more replies)
  2017-01-13 20:13                                 ` Dan Streetman
  1 sibling, 3 replies; 34+ messages in thread
From: Dan Streetman @ 2017-01-13 20:07 UTC (permalink / raw)
  To: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk, Boris Ostrovsky
  Cc: Dan Streetman, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

Revert the main part of commit:
af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")

That commit introduced reading the pci device's msi message data to see
if a pirq was previously configured for the device's msi/msix, and re-use
that pirq.  At the time, that was the correct behavior.  However, a
later change to Qemu caused it to call into the Xen hypervisor to unmap
all pirqs for a pci device, when the pci device disables its MSI/MSIX
vectors; specifically the Qemu commit:
c976437c7dba9c7444fb41df45468968aaa326ad
("qemu-xen: free all the pirqs for msi/msix when driver unload")

Once Qemu added this pirq unmapping, it was no longer correct for the
kernel to re-use the pirq number cached in the pci device msi message
data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
pirqs when the pci device disables its MSI/MSIX vectors.

This bug is causing failures to initialize multiple NVMe controllers
under Xen, because the NVMe driver sets up a single MSIX vector for
each controller (concurrently), and then after using that to talk to
the controller for some configuration data, it disables the single MSIX
vector and re-configures all the MSIX vectors it needs.  So the MSIX
setup code tries to re-use the cached pirq from the first vector
for each controller, but the hypervisor has already given away that
pirq to another controller, and its initialization fails.

This is discussed in more detail at:
https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html

Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
---
 arch/x86/pci/xen.c | 23 +++++++----------------
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index bedfab9..a00a6c0 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 		return 1;

 	for_each_pci_msi_entry(msidesc, dev) {
-		__pci_read_msi_msg(msidesc, &msg);
-		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
-			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
-		if (msg.data != XEN_PIRQ_MSI_DATA ||
-		    xen_irq_from_pirq(pirq) < 0) {
-			pirq = xen_allocate_pirq_msi(dev, msidesc);
-			if (pirq < 0) {
-				irq = -ENODEV;
-				goto error;
-			}
-			xen_msi_compose_msg(dev, pirq, &msg);
-			__pci_write_msi_msg(msidesc, &msg);
-			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
-		} else {
-			dev_dbg(&dev->dev,
-				"xen: msi already bound to pirq=%d\n", pirq);
+		pirq = xen_allocate_pirq_msi(dev, msidesc);
+		if (pirq < 0) {
+			irq = -ENODEV;
+			goto error;
 		}
+		xen_msi_compose_msg(dev, pirq, &msg);
+		__pci_write_msi_msg(msidesc, &msg);
+		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
 		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
 					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
 					       (type == PCI_CAP_ID_MSIX) ?
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 20:00                               ` Boris Ostrovsky
  2017-01-13 20:07                                 ` Dan Streetman
@ 2017-01-13 20:13                                 ` Dan Streetman
  1 sibling, 0 replies; 34+ messages in thread
From: Dan Streetman @ 2017-01-13 20:13 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Fri, Jan 13, 2017 at 3:00 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 01/13/2017 01:44 PM, Stefano Stabellini wrote:
>> On Fri, 13 Jan 2017, Dan Streetman wrote:
>>> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote:
>>>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini
>>>> <sstabellini@kernel.org> wrote:
>>>>> On Wed, 11 Jan 2017, Dan Streetman wrote:
>>>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
>>>>>> <sstabellini@kernel.org> wrote:
>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote:
>>>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
>>>>>>>> <sstabellini@kernel.org> wrote:
>>>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote:
>>>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote:
>>>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
>>>>>>>>>>> <sstabellini@kernel.org> wrote:
>>>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>>>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was
>>>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was
>>>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device.  The previous
>>>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be
>>>>>>>>>>>>>>>>> allocated from the hypervisor.
>>>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs
>>>>>>>>>>>>>>>> as we are not reusing them?
>>>>>>>>>>>>>>> Don't we free the pirq when we unmap it?
>>>>>>>>>>>>>> I think this is actually a bit worse than I initially thought.  After
>>>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq
>>>>>>>>>>>>>> allocation:
>>>>>>>>>>>>> Lets include Stefano,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu
>>>>>>>>>>>>> feeling as I believe I hit this at some point in the past and
>>>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't
>>>>>>>>>>>>> find the thread.
>>>>>>>>>>>> This issue seems to be caused by:
>>>>>>>>>>>>
>>>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>>>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>>>>>>>>>>> Date:   Wed Dec 1 14:51:44 2010 +0000
>>>>>>>>>>>>
>>>>>>>>>>>>     xen: fix MSI setup and teardown for PV on HVM guests
>>>>>>>>>>>>
>>>>>>>>>>>> which was a fix to a bug:
>>>>>>>>>>>>
>>>>>>>>>>>>     This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>>>>>>>>>>>>     trying to enable the same MSI for the second time: the old MSI to pirq
>>>>>>>>>>>>     mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>>>>>>>>>>>>     try to assign a new pirq anyway.
>>>>>>>>>>>>     A simple way to reproduce this bug is to assign an MSI capable network
>>>>>>>>>>>>     card to a PV on HVM guest, if the user brings down the corresponding
>>>>>>>>>>>>     ethernet interface and up again, Linux would fail to enable MSIs on the
>>>>>>>>>>>>     device.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't remember any of the details. From the description of this bug,
>>>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used
>>>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>>>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>>>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it
>>>>>>>>>>>> is described in this email.
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced
>>>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert
>>>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>>>>>>>>>>>> behavior conditional to the Xen version.
>>>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest?
>>>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices.
>>>>>>>>> They are the main ones. It is possible to have emulated virtio devices
>>>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in
>>>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken
>>>>>>>>> too.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on,
>>>>>>>>>>> with the patch from this thread (which essentially reverts your commit
>>>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of
>>>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
>>>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe,
>>>>>>>>>>> since they were freed by the rmmod.
>>>>>>>>> I am fine with reverting the old patch, but we need to understand what
>>>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI
>>>>>>>>> passthrough setup at hand can help with that.
>>>>>>>>>
>>>>>>>>> In the Xen code I can still see:
>>>>>>>>>
>>>>>>>>>     case ECS_PIRQ: {
>>>>>>>>>         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
>>>>>>>>>
>>>>>>>>>         if ( !pirq )
>>>>>>>>>             break;
>>>>>>>>>         if ( !is_hvm_domain(d1) )
>>>>>>>>>             pirq_guest_unbind(d1, pirq);
>>>>>>>>>
>>>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close
>>>>>>>>> if the guest is not an HVM guest.
>>>>>>>> I tried an experiment to call get_free_pirq on both sides of a
>>>>>>>> evtchn_close hcall, using two SRIOV nics.  When I rmmod/modprobe, I
>>>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like
>>>>>>>> when they shut down, each nic's 3 pirqs are not available until after
>>>>>>>> the nic is done shutting down, so it seems like closing the evtchn
>>>>>>>> isn't what makes the pirq free.
>>>>>>>>
>>>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
>>>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
>>>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
>>>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
>>>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
>>>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
>>>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
>>>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
>>>>>>>>
>>>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
>>>>>>>>
>>>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
>>>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
>>>>>>>> 0, xen_pvh_domain() == 0
>>>>>>>>
>>>>>>>> just to be sure, a bit of dbg so I know what domain this is :-)
>>>>>>>>
>>>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
>>>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
>>>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
>>>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
>>>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
>>>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
>>>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
>>>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
>>>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
>>>>>>>>
>>>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
>>>>>>>> none of those become available for get_free_pirq.
>>>>>>>>
>>>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
>>>>>>>>
>>>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started
>>>>>>>> shutdown; we see those pirqs from nic 4 are now available.  So it must
>>>>>>>> not be evtchn closing that frees the pirqs.
>>>>>>>>
>>>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
>>>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
>>>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
>>>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
>>>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
>>>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
>>>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
>>>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
>>>>>>>>
>>>>>>>>
>>>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
>>>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
>>>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed
>>>>>>>> before that is called:
>>>>>>>>
>>>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
>>>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
>>>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
>>>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
>>>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
>>>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
>>>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
>>>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
>>>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
>>>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
>>>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
>>>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm still following thru the kernel code, but it's not immediately
>>>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any
>>>>>>>> idea?
>>>>>>>>
>>>>>>>> >From the hypervisor code, it seems that the pirq is "available" via
>>>>>>>> is_free_pirq():
>>>>>>>>     return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
>>>>>>>>         pirq->arch.hvm.emuirq == IRQ_UNBOUND));
>>>>>>>>
>>>>>>>> when the evtchn is closed, it does:
>>>>>>>>         if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
>>>>>>>>             unmap_domain_pirq_emuirq(d1, pirq->pirq);
>>>>>>>>
>>>>>>>> and that call to unmap_domain_pirq_emuirq does:
>>>>>>>>         info->arch.hvm.emuirq = IRQ_UNBOUND;
>>>>>>>>
>>>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place
>>>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only
>>>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
>>>>>>>> seeing where either of those would be called when all the kernel is
>>>>>>>> doing is disabling a pci device.
>>>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped:
>>>>>>> when Linux disables msi or msix on the device, using the regular pci
>>>>>>> config space based method, QEMU (which emulates the PCI config space)
>>>>>>> tells Xen to unmap the pirq.
>>>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe?  Well that makes more sense then.
>>>>>>
>>>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
>>>>>>> to be called a second time without msis being disabled first, then I
>>>>>>> think we can revert the patch.
>>>>>> It doesn't seem possible to call it twice from a correctly-behaved
>>>>>> driver, but in case of a driver bug that does try to enable msi/msix
>>>>>> multiple times without disabling, __pci_enable_msix() only does
>>>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
>>>>>> WARN_ON(!!dev->msi_enabled); they both will continue.  Maybe that
>>>>>> should be changed to warn and also return error, to prevent
>>>>>> re-configuring msi/msix if already configured?  Or, maybe the warning
>>>>>> is enough - the worst thing that reverting the patch does is use extra
>>>>>> pirqs, right?
>>>>> I think the warning is enough.  Can you confirm that with
>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also
>>>>>
>>>>> ifconfig eth0 down; ifconfig eth0 up
>>>>>
>>>>> still work as expected, no warnings?
>>>> yes, with the patch that started this thread - which essentially
>>>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no
>>>> warnings and ifconfig down ; ifconfig up work as expected.
>>>>
>>>>>
>>>>> It looks like the patch that changed hypervisor (QEMU actually) behavior
>>>>> is:
>>>>>
>>>>> commit c976437c7dba9c7444fb41df45468968aaa326ad
>>>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com>
>>>>> Date:   Wed May 7 13:41:48 2014 +0000
>>>>>
>>>>>     qemu-xen: free all the pirqs for msi/msix when driver unload
>>>>>
>>>>> From this commit onward, QEMU started unmapping pirqs when MSIs are
>>>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
>>>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.
>>>>>
>>>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
>>>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
>>>>> and older. Given that Xen 4.4 is out of support, I think we should go
>>>>> ahead with it.  Opinions?
>>> Looks like there's no complaints; is my patch from the start of this
>>> thread ok to use, or can you craft a patch to use?  My patch's
>>> description could use updating to add some of the info/background from
>>> this discussion...
>> Hi Dan, I would like an explicit Ack from the other maintainers, Boris
>> and Juergen. Let me place them in To: to make it more obvious.
>
>
> Where is the patch? I don't think 'git revert' will work.

I just re-sent the same patch that originally started this long
thread, except with the description updated to include details on the
previous commits and a Fixes: line.  It reverts the main part of
commit af42b8d12f8adec6711cb824549a0edac6a4ae8f, but there are other
changes in that commit that either don't apply anymore and/or were
already removed/changed (like changes to xen_allocate_pirq_msi
function), or changes that don't need reverting (like adding the
XEN_PIRQ_MSI_DATA #define, which should stay).

The only part that really needed reverting was the code in
xen_hvm_setup_msi_irqs() that checked the pirq number that was cached
in the pci msi message data.

Thanks!

>
> And Konrad will need to ack it too as he is Xen-PCI maintainer.
>
> -boris
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 20:07                                 ` Dan Streetman
@ 2017-01-13 20:54                                   ` Stefano Stabellini
  2017-01-13 21:49                                   ` Boris Ostrovsky
  2017-01-13 22:30                                   ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 34+ messages in thread
From: Stefano Stabellini @ 2017-01-13 20:54 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On Fri, 13 Jan 2017, Dan Streetman wrote:
> Revert the main part of commit:
> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> 
> That commit introduced reading the pci device's msi message data to see
> if a pirq was previously configured for the device's msi/msix, and re-use
> that pirq.  At the time, that was the correct behavior.  However, a
> later change to Qemu caused it to call into the Xen hypervisor to unmap
> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> vectors; specifically the Qemu commit:
> c976437c7dba9c7444fb41df45468968aaa326ad
> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> 
> Once Qemu added this pirq unmapping, it was no longer correct for the
> kernel to re-use the pirq number cached in the pci device msi message
> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> pirqs when the pci device disables its MSI/MSIX vectors.
> 
> This bug is causing failures to initialize multiple NVMe controllers
> under Xen, because the NVMe driver sets up a single MSIX vector for
> each controller (concurrently), and then after using that to talk to
> the controller for some configuration data, it disables the single MSIX
> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> setup code tries to re-use the cached pirq from the first vector
> for each controller, but the hypervisor has already given away that
> pirq to another controller, and its initialization fails.
> 
> This is discussed in more detail at:
> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> 
> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
>  arch/x86/pci/xen.c | 23 +++++++----------------
>  1 file changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index bedfab9..a00a6c0 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>  		return 1;
>  
>  	for_each_pci_msi_entry(msidesc, dev) {
> -		__pci_read_msi_msg(msidesc, &msg);
> -		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> -			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> -		if (msg.data != XEN_PIRQ_MSI_DATA ||
> -		    xen_irq_from_pirq(pirq) < 0) {
> -			pirq = xen_allocate_pirq_msi(dev, msidesc);
> -			if (pirq < 0) {
> -				irq = -ENODEV;
> -				goto error;
> -			}
> -			xen_msi_compose_msg(dev, pirq, &msg);
> -			__pci_write_msi_msg(msidesc, &msg);
> -			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> -		} else {
> -			dev_dbg(&dev->dev,
> -				"xen: msi already bound to pirq=%d\n", pirq);
> +		pirq = xen_allocate_pirq_msi(dev, msidesc);
> +		if (pirq < 0) {
> +			irq = -ENODEV;
> +			goto error;
>  		}
> +		xen_msi_compose_msg(dev, pirq, &msg);
> +		__pci_write_msi_msg(msidesc, &msg);
> +		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>  		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>  					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
>  					       (type == PCI_CAP_ID_MSIX) ?
> -- 
> 2.9.3
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 20:07                                 ` Dan Streetman
  2017-01-13 20:54                                   ` Stefano Stabellini
@ 2017-01-13 21:49                                   ` Boris Ostrovsky
  2017-01-13 22:30                                   ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 34+ messages in thread
From: Boris Ostrovsky @ 2017-01-13 21:49 UTC (permalink / raw)
  To: Dan Streetman, Stefano Stabellini, jgross, Konrad Rzeszutek Wilk
  Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On 01/13/2017 03:07 PM, Dan Streetman wrote:
> Revert the main part of commit:
> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>
> That commit introduced reading the pci device's msi message data to see
> if a pirq was previously configured for the device's msi/msix, and re-use
> that pirq.  At the time, that was the correct behavior.  However, a
> later change to Qemu caused it to call into the Xen hypervisor to unmap
> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> vectors; specifically the Qemu commit:
> c976437c7dba9c7444fb41df45468968aaa326ad
> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>
> Once Qemu added this pirq unmapping, it was no longer correct for the
> kernel to re-use the pirq number cached in the pci device msi message
> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> pirqs when the pci device disables its MSI/MSIX vectors.
>
> This bug is causing failures to initialize multiple NVMe controllers
> under Xen, because the NVMe driver sets up a single MSIX vector for
> each controller (concurrently), and then after using that to talk to
> the controller for some configuration data, it disables the single MSIX
> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> setup code tries to re-use the cached pirq from the first vector
> for each controller, but the hypervisor has already given away that
> pirq to another controller, and its initialization fails.
>
> This is discussed in more detail at:
> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>
> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

> ---
>  arch/x86/pci/xen.c | 23 +++++++----------------
>  1 file changed, 7 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index bedfab9..a00a6c0 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>  		return 1;
>  
>  	for_each_pci_msi_entry(msidesc, dev) {
> -		__pci_read_msi_msg(msidesc, &msg);
> -		pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> -			((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> -		if (msg.data != XEN_PIRQ_MSI_DATA ||
> -		    xen_irq_from_pirq(pirq) < 0) {
> -			pirq = xen_allocate_pirq_msi(dev, msidesc);
> -			if (pirq < 0) {
> -				irq = -ENODEV;
> -				goto error;
> -			}
> -			xen_msi_compose_msg(dev, pirq, &msg);
> -			__pci_write_msi_msg(msidesc, &msg);
> -			dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
> -		} else {
> -			dev_dbg(&dev->dev,
> -				"xen: msi already bound to pirq=%d\n", pirq);
> +		pirq = xen_allocate_pirq_msi(dev, msidesc);
> +		if (pirq < 0) {
> +			irq = -ENODEV;
> +			goto error;
>  		}
> +		xen_msi_compose_msg(dev, pirq, &msg);
> +		__pci_write_msi_msg(msidesc, &msg);
> +		dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>  		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>  					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
>  					       (type == PCI_CAP_ID_MSIX) ?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 20:07                                 ` Dan Streetman
  2017-01-13 20:54                                   ` Stefano Stabellini
  2017-01-13 21:49                                   ` Boris Ostrovsky
@ 2017-01-13 22:30                                   ` Konrad Rzeszutek Wilk
  2017-02-21 15:31                                     ` Dan Streetman
  2 siblings, 1 reply; 34+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-01-13 22:30 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Stefano Stabellini, jgross, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> Revert the main part of commit:
> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> 
> That commit introduced reading the pci device's msi message data to see
> if a pirq was previously configured for the device's msi/msix, and re-use
> that pirq.  At the time, that was the correct behavior.  However, a
> later change to Qemu caused it to call into the Xen hypervisor to unmap
> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> vectors; specifically the Qemu commit:
> c976437c7dba9c7444fb41df45468968aaa326ad
> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> 
> Once Qemu added this pirq unmapping, it was no longer correct for the
> kernel to re-use the pirq number cached in the pci device msi message
> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> pirqs when the pci device disables its MSI/MSIX vectors.
> 
> This bug is causing failures to initialize multiple NVMe controllers
> under Xen, because the NVMe driver sets up a single MSIX vector for
> each controller (concurrently), and then after using that to talk to
> the controller for some configuration data, it disables the single MSIX
> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> setup code tries to re-use the cached pirq from the first vector
> for each controller, but the hypervisor has already given away that
> pirq to another controller, and its initialization fails.
> 
> This is discussed in more detail at:
> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> 
> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>

Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-01-13 22:30                                   ` Konrad Rzeszutek Wilk
@ 2017-02-21 15:31                                     ` Dan Streetman
  2017-02-21 15:45                                       ` Juergen Gross
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Streetman @ 2017-02-21 15:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, jgross, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
>> Revert the main part of commit:
>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>
>> That commit introduced reading the pci device's msi message data to see
>> if a pirq was previously configured for the device's msi/msix, and re-use
>> that pirq.  At the time, that was the correct behavior.  However, a
>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>> vectors; specifically the Qemu commit:
>> c976437c7dba9c7444fb41df45468968aaa326ad
>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>
>> Once Qemu added this pirq unmapping, it was no longer correct for the
>> kernel to re-use the pirq number cached in the pci device msi message
>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>> pirqs when the pci device disables its MSI/MSIX vectors.
>>
>> This bug is causing failures to initialize multiple NVMe controllers
>> under Xen, because the NVMe driver sets up a single MSIX vector for
>> each controller (concurrently), and then after using that to talk to
>> the controller for some configuration data, it disables the single MSIX
>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>> setup code tries to re-use the cached pirq from the first vector
>> for each controller, but the hypervisor has already given away that
>> pirq to another controller, and its initialization fails.
>>
>> This is discussed in more detail at:
>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>
>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>
> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

This doesn't seem to be applied yet, is it still waiting on another
ack?  Or maybe I'm looking at the wrong git tree...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-02-21 15:31                                     ` Dan Streetman
@ 2017-02-21 15:45                                       ` Juergen Gross
  2017-02-21 15:58                                         ` Boris Ostrovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Juergen Gross @ 2017-02-21 15:45 UTC (permalink / raw)
  To: Dan Streetman, Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Boris Ostrovsky, Dan Streetman,
	Bjorn Helgaas, xen-devel, linux-kernel, linux-pci

On 21/02/17 16:31, Dan Streetman wrote:
> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
>>> Revert the main part of commit:
>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>
>>> That commit introduced reading the pci device's msi message data to see
>>> if a pirq was previously configured for the device's msi/msix, and re-use
>>> that pirq.  At the time, that was the correct behavior.  However, a
>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>>> vectors; specifically the Qemu commit:
>>> c976437c7dba9c7444fb41df45468968aaa326ad
>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>>
>>> Once Qemu added this pirq unmapping, it was no longer correct for the
>>> kernel to re-use the pirq number cached in the pci device msi message
>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>>> pirqs when the pci device disables its MSI/MSIX vectors.
>>>
>>> This bug is causing failures to initialize multiple NVMe controllers
>>> under Xen, because the NVMe driver sets up a single MSIX vector for
>>> each controller (concurrently), and then after using that to talk to
>>> the controller for some configuration data, it disables the single MSIX
>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>>> setup code tries to re-use the cached pirq from the first vector
>>> for each controller, but the hypervisor has already given away that
>>> pirq to another controller, and its initialization fails.
>>>
>>> This is discussed in more detail at:
>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>>
>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>
>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> This doesn't seem to be applied yet, is it still waiting on another
> ack?  Or maybe I'm looking at the wrong git tree...

Am I wrong or shouldn't this go through the PCI tree? Konrad?


Juergen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-02-21 15:45                                       ` Juergen Gross
@ 2017-02-21 15:58                                         ` Boris Ostrovsky
  2017-02-22 14:28                                           ` Bjorn Helgaas
  0 siblings, 1 reply; 34+ messages in thread
From: Boris Ostrovsky @ 2017-02-21 15:58 UTC (permalink / raw)
  To: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On 02/21/2017 10:45 AM, Juergen Gross wrote:
> On 21/02/17 16:31, Dan Streetman wrote:
>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
>>>> Revert the main part of commit:
>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>
>>>> That commit introduced reading the pci device's msi message data to see
>>>> if a pirq was previously configured for the device's msi/msix, and re-use
>>>> that pirq.  At the time, that was the correct behavior.  However, a
>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>>>> vectors; specifically the Qemu commit:
>>>> c976437c7dba9c7444fb41df45468968aaa326ad
>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>>>
>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
>>>> kernel to re-use the pirq number cached in the pci device msi message
>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>>>> pirqs when the pci device disables its MSI/MSIX vectors.
>>>>
>>>> This bug is causing failures to initialize multiple NVMe controllers
>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
>>>> each controller (concurrently), and then after using that to talk to
>>>> the controller for some configuration data, it disables the single MSIX
>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>>>> setup code tries to re-use the cached pirq from the first vector
>>>> for each controller, but the hypervisor has already given away that
>>>> pirq to another controller, and its initialization fails.
>>>>
>>>> This is discussed in more detail at:
>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>>>
>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> This doesn't seem to be applied yet, is it still waiting on another
>> ack?  Or maybe I'm looking at the wrong git tree...
> Am I wrong or shouldn't this go through the PCI tree? Konrad?

Konrad is away this week but since pull request for Xen tree just went
out we should probably wait until rc1 anyway (unless something big comes
up before that).

-boris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-02-21 15:58                                         ` Boris Ostrovsky
@ 2017-02-22 14:28                                           ` Bjorn Helgaas
  2017-02-22 15:14                                             ` Boris Ostrovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Bjorn Helgaas @ 2017-02-22 14:28 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk,
	Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> On 02/21/2017 10:45 AM, Juergen Gross wrote:
> > On 21/02/17 16:31, Dan Streetman wrote:
> >> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> >>>> Revert the main part of commit:
> >>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> >>>>
> >>>> That commit introduced reading the pci device's msi message data to see
> >>>> if a pirq was previously configured for the device's msi/msix, and re-use
> >>>> that pirq.  At the time, that was the correct behavior.  However, a
> >>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
> >>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> >>>> vectors; specifically the Qemu commit:
> >>>> c976437c7dba9c7444fb41df45468968aaa326ad
> >>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> >>>>
> >>>> Once Qemu added this pirq unmapping, it was no longer correct for the
> >>>> kernel to re-use the pirq number cached in the pci device msi message
> >>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> >>>> pirqs when the pci device disables its MSI/MSIX vectors.
> >>>>
> >>>> This bug is causing failures to initialize multiple NVMe controllers
> >>>> under Xen, because the NVMe driver sets up a single MSIX vector for
> >>>> each controller (concurrently), and then after using that to talk to
> >>>> the controller for some configuration data, it disables the single MSIX
> >>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> >>>> setup code tries to re-use the cached pirq from the first vector
> >>>> for each controller, but the hypervisor has already given away that
> >>>> pirq to another controller, and its initialization fails.
> >>>>
> >>>> This is discussed in more detail at:
> >>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> >>>>
> >>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> >>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> >>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> This doesn't seem to be applied yet, is it still waiting on another
> >> ack?  Or maybe I'm looking at the wrong git tree...
> > Am I wrong or shouldn't this go through the PCI tree? Konrad?
> 
> Konrad is away this week but since pull request for Xen tree just went
> out we should probably wait until rc1 anyway (unless something big comes
> up before that).

I assume this should go via the Xen or x86 tree, since that's how most
arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
If you think otherwise, let me know.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-02-22 14:28                                           ` Bjorn Helgaas
@ 2017-02-22 15:14                                             ` Boris Ostrovsky
  2017-05-03 18:19                                               ` [Xen-devel] " David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Boris Ostrovsky @ 2017-02-22 15:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk,
	Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel,
	linux-kernel, linux-pci

On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
>>> On 21/02/17 16:31, Dan Streetman wrote:
>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
>>>> <konrad.wilk@oracle.com> wrote:
>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
>>>>>> Revert the main part of commit:
>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>>>
>>>>>> That commit introduced reading the pci device's msi message data to see
>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>>>>>> vectors; specifically the Qemu commit:
>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>>>>>
>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
>>>>>> kernel to re-use the pirq number cached in the pci device msi message
>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
>>>>>>
>>>>>> This bug is causing failures to initialize multiple NVMe controllers
>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
>>>>>> each controller (concurrently), and then after using that to talk to
>>>>>> the controller for some configuration data, it disables the single MSIX
>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>>>>>> setup code tries to re-use the cached pirq from the first vector
>>>>>> for each controller, but the hypervisor has already given away that
>>>>>> pirq to another controller, and its initialization fails.
>>>>>>
>>>>>> This is discussed in more detail at:
>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>>>>>
>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> This doesn't seem to be applied yet, is it still waiting on another
>>>> ack?  Or maybe I'm looking at the wrong git tree...
>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
>> Konrad is away this week but since pull request for Xen tree just went
>> out we should probably wait until rc1 anyway (unless something big comes
>> up before that).
> I assume this should go via the Xen or x86 tree, since that's how most
> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> If you think otherwise, let me know.

OK, I applied it to Xen tree's for-linus-4.11.

-boris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-02-22 15:14                                             ` Boris Ostrovsky
@ 2017-05-03 18:19                                               ` David Woodhouse
  2017-05-03 18:43                                                 ` Boris Ostrovsky
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2017-05-03 18:19 UTC (permalink / raw)
  To: Boris Ostrovsky, Bjorn Helgaas, stable
  Cc: Juergen Gross, Stefano Stabellini, linux-pci, linux-kernel,
	Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman

[-- Attachment #1: Type: text/plain, Size: 3429 bytes --]

On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> > 
> > On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> > > 
> > > On 02/21/2017 10:45 AM, Juergen Gross wrote:
> > > > 
> > > > On 21/02/17 16:31, Dan Streetman wrote:
> > > > > 
> > > > > On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> > > > > <konrad.wilk@oracle.com> wrote:
> > > > > > 
> > > > > > On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> > > > > > > 
> > > > > > > Revert the main part of commit:
> > > > > > > af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > > > > > 
> > > > > > > That commit introduced reading the pci device's msi message data to see
> > > > > > > if a pirq was previously configured for the device's msi/msix, and re-use
> > > > > > > that pirq.  At the time, that was the correct behavior.  However, a
> > > > > > > later change to Qemu caused it to call into the Xen hypervisor to unmap
> > > > > > > all pirqs for a pci device, when the pci device disables its MSI/MSIX
> > > > > > > vectors; specifically the Qemu commit:
> > > > > > > c976437c7dba9c7444fb41df45468968aaa326ad
> > > > > > > ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> > > > > > > 
> > > > > > > Once Qemu added this pirq unmapping, it was no longer correct for the
> > > > > > > kernel to re-use the pirq number cached in the pci device msi message
> > > > > > > data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> > > > > > > pirqs when the pci device disables its MSI/MSIX vectors.
> > > > > > > 
> > > > > > > This bug is causing failures to initialize multiple NVMe controllers
> > > > > > > under Xen, because the NVMe driver sets up a single MSIX vector for
> > > > > > > each controller (concurrently), and then after using that to talk to
> > > > > > > the controller for some configuration data, it disables the single MSIX
> > > > > > > vector and re-configures all the MSIX vectors it needs.  So the MSIX
> > > > > > > setup code tries to re-use the cached pirq from the first vector
> > > > > > > for each controller, but the hypervisor has already given away that
> > > > > > > pirq to another controller, and its initialization fails.
> > > > > > > 
> > > > > > > This is discussed in more detail at:
> > > > > > > https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> > > > > > > 
> > > > > > > Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > > > > > Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> > > > > > Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > > This doesn't seem to be applied yet, is it still waiting on another
> > > > > ack?  Or maybe I'm looking at the wrong git tree...
> > > > Am I wrong or shouldn't this go through the PCI tree? Konrad?
> > > Konrad is away this week but since pull request for Xen tree just went
> > > out we should probably wait until rc1 anyway (unless something big comes
> > > up before that).
> > I assume this should go via the Xen or x86 tree, since that's how most
> > arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> > If you think otherwise, let me know.
>
> OK, I applied it to Xen tree's for-linus-4.11.

Hm, we want this (c74fd80f2f4) in stable too, don't we?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-05-03 18:19                                               ` [Xen-devel] " David Woodhouse
@ 2017-05-03 18:43                                                 ` Boris Ostrovsky
  2017-05-03 22:59                                                   ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Boris Ostrovsky @ 2017-05-03 18:43 UTC (permalink / raw)
  To: David Woodhouse, Bjorn Helgaas, stable, Stefano Stabellini
  Cc: Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel,
	Dan Streetman, Dan Streetman

On 05/03/2017 02:19 PM, David Woodhouse wrote:
> On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
>> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
>>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
>>>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
>>>>> On 21/02/17 16:31, Dan Streetman wrote:
>>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
>>>>>> <konrad.wilk@oracle.com> wrote:
>>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
>>>>>>>> Revert the main part of commit:
>>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>>>>>
>>>>>>>> That commit introduced reading the pci device's msi message data to see
>>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
>>>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
>>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>>>>>>>> vectors; specifically the Qemu commit:
>>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
>>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>>>>>>>
>>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
>>>>>>>> kernel to re-use the pirq number cached in the pci device msi message
>>>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
>>>>>>>>
>>>>>>>> This bug is causing failures to initialize multiple NVMe controllers
>>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
>>>>>>>> each controller (concurrently), and then after using that to talk to
>>>>>>>> the controller for some configuration data, it disables the single MSIX
>>>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>>>>>>>> setup code tries to re-use the cached pirq from the first vector
>>>>>>>> for each controller, but the hypervisor has already given away that
>>>>>>>> pirq to another controller, and its initialization fails.
>>>>>>>>
>>>>>>>> This is discussed in more detail at:
>>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>>>>>>>
>>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
>>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>>>> This doesn't seem to be applied yet, is it still waiting on another
>>>>>> ack?  Or maybe I'm looking at the wrong git tree...
>>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
>>>> Konrad is away this week but since pull request for Xen tree just went
>>>> out we should probably wait until rc1 anyway (unless something big comes
>>>> up before that).
>>> I assume this should go via the Xen or x86 tree, since that's how most
>>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
>>> If you think otherwise, let me know.
>> OK, I applied it to Xen tree's for-linus-4.11.
> Hm, we want this (c74fd80f2f4) in stable too, don't we?


Maybe.

Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html
it may break things on older (4.4-) hypervisors. They are out of
support, which is why this patch went in now but I am not sure this
automatically applies to stable kernels.

Stefano?

-boris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-05-03 18:43                                                 ` Boris Ostrovsky
@ 2017-05-03 22:59                                                   ` Stefano Stabellini
  2017-05-03 23:06                                                     ` Greg KH
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-05-03 22:59 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: David Woodhouse, Bjorn Helgaas, stable, Stefano Stabellini,
	Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel,
	Dan Streetman, Dan Streetman

On Wed, 3 May 2017, Boris Ostrovsky wrote:
> On 05/03/2017 02:19 PM, David Woodhouse wrote:
> > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
> >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
> >>>>> On 21/02/17 16:31, Dan Streetman wrote:
> >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> >>>>>> <konrad.wilk@oracle.com> wrote:
> >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> >>>>>>>> Revert the main part of commit:
> >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> >>>>>>>>
> >>>>>>>> That commit introduced reading the pci device's msi message data to see
> >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
> >>>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
> >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
> >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> >>>>>>>> vectors; specifically the Qemu commit:
> >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
> >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> >>>>>>>>
> >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
> >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message
> >>>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
> >>>>>>>>
> >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers
> >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
> >>>>>>>> each controller (concurrently), and then after using that to talk to
> >>>>>>>> the controller for some configuration data, it disables the single MSIX
> >>>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> >>>>>>>> setup code tries to re-use the cached pirq from the first vector
> >>>>>>>> for each controller, but the hypervisor has already given away that
> >>>>>>>> pirq to another controller, and its initialization fails.
> >>>>>>>>
> >>>>>>>> This is discussed in more detail at:
> >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> >>>>>>>>
> >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >>>>>> This doesn't seem to be applied yet, is it still waiting on another
> >>>>>> ack?  Or maybe I'm looking at the wrong git tree...
> >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
> >>>> Konrad is away this week but since pull request for Xen tree just went
> >>>> out we should probably wait until rc1 anyway (unless something big comes
> >>>> up before that).
> >>> I assume this should go via the Xen or x86 tree, since that's how most
> >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> >>> If you think otherwise, let me know.
> >> OK, I applied it to Xen tree's for-linus-4.11.
> > Hm, we want this (c74fd80f2f4) in stable too, don't we?
> 
> 
> Maybe.
> 
> Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html
> it may break things on older (4.4-) hypervisors. They are out of
> support, which is why this patch went in now but I am not sure this
> automatically applies to stable kernels.
> 
> Stefano?

This is a difficult call. We could just say that all the broken Xen
versions are out of support, so let's fix all the Linux kernel stable
trees that we can.

Or we could give a look at the release dates. Linux 3.18 is still
maintained and was tagged on Dec 7 2014. Xen 4.4 was tagged on Mar 10
2014. We could ask a backport for [4.5+], which was released 2 years
after Xen 4.4. Of course, it is still arbitrary but aims at minimizing
breakages.

What do you think?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-05-03 22:59                                                   ` Stefano Stabellini
@ 2017-05-03 23:06                                                     ` Greg KH
  2017-05-03 23:12                                                       ` Stefano Stabellini
  0 siblings, 1 reply; 34+ messages in thread
From: Greg KH @ 2017-05-03 23:06 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Boris Ostrovsky, David Woodhouse, Bjorn Helgaas, stable,
	Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel,
	Dan Streetman, Dan Streetman

On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote:
> On Wed, 3 May 2017, Boris Ostrovsky wrote:
> > On 05/03/2017 02:19 PM, David Woodhouse wrote:
> > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
> > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
> > >>>>> On 21/02/17 16:31, Dan Streetman wrote:
> > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> > >>>>>> <konrad.wilk@oracle.com> wrote:
> > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> > >>>>>>>> Revert the main part of commit:
> > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > >>>>>>>>
> > >>>>>>>> That commit introduced reading the pci device's msi message data to see
> > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
> > >>>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
> > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
> > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> > >>>>>>>> vectors; specifically the Qemu commit:
> > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
> > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> > >>>>>>>>
> > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
> > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message
> > >>>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
> > >>>>>>>>
> > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers
> > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
> > >>>>>>>> each controller (concurrently), and then after using that to talk to
> > >>>>>>>> the controller for some configuration data, it disables the single MSIX
> > >>>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> > >>>>>>>> setup code tries to re-use the cached pirq from the first vector
> > >>>>>>>> for each controller, but the hypervisor has already given away that
> > >>>>>>>> pirq to another controller, and its initialization fails.
> > >>>>>>>>
> > >>>>>>>> This is discussed in more detail at:
> > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> > >>>>>>>>
> > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > >>>>>> This doesn't seem to be applied yet, is it still waiting on another
> > >>>>>> ack?  Or maybe I'm looking at the wrong git tree...
> > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
> > >>>> Konrad is away this week but since pull request for Xen tree just went
> > >>>> out we should probably wait until rc1 anyway (unless something big comes
> > >>>> up before that).
> > >>> I assume this should go via the Xen or x86 tree, since that's how most
> > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> > >>> If you think otherwise, let me know.
> > >> OK, I applied it to Xen tree's for-linus-4.11.
> > > Hm, we want this (c74fd80f2f4) in stable too, don't we?
> > 
> > 
> > Maybe.
> > 
> > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html
> > it may break things on older (4.4-) hypervisors. They are out of
> > support, which is why this patch went in now but I am not sure this
> > automatically applies to stable kernels.
> > 
> > Stefano?
> 
> This is a difficult call. We could just say that all the broken Xen
> versions are out of support, so let's fix all the Linux kernel stable
> trees that we can.
> 
> Or we could give a look at the release dates. Linux 3.18 is still
> maintained and was tagged on Dec 7 2014.

Don't do anything "special" for 3.18 if you have to.  I'm only
semi-maintaining it because some SoC vendors never upstreamed their
trees and lots of devices rely on it.  None of them use Xen on their
platforms, so no need for me to backport any change there.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-05-03 23:06                                                     ` Greg KH
@ 2017-05-03 23:12                                                       ` Stefano Stabellini
  2017-05-03 23:19                                                         ` Greg KH
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2017-05-03 23:12 UTC (permalink / raw)
  To: Greg KH
  Cc: Stefano Stabellini, Boris Ostrovsky, David Woodhouse,
	Bjorn Helgaas, stable, Juergen Gross, linux-pci, linux-kernel,
	Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman

On Wed, 3 May 2017, Greg KH wrote:
> On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote:
> > On Wed, 3 May 2017, Boris Ostrovsky wrote:
> > > On 05/03/2017 02:19 PM, David Woodhouse wrote:
> > > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
> > > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> > > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> > > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
> > > >>>>> On 21/02/17 16:31, Dan Streetman wrote:
> > > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> > > >>>>>> <konrad.wilk@oracle.com> wrote:
> > > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> > > >>>>>>>> Revert the main part of commit:
> > > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > >>>>>>>>
> > > >>>>>>>> That commit introduced reading the pci device's msi message data to see
> > > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
> > > >>>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
> > > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
> > > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> > > >>>>>>>> vectors; specifically the Qemu commit:
> > > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
> > > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> > > >>>>>>>>
> > > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
> > > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message
> > > >>>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> > > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
> > > >>>>>>>>
> > > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers
> > > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
> > > >>>>>>>> each controller (concurrently), and then after using that to talk to
> > > >>>>>>>> the controller for some configuration data, it disables the single MSIX
> > > >>>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> > > >>>>>>>> setup code tries to re-use the cached pirq from the first vector
> > > >>>>>>>> for each controller, but the hypervisor has already given away that
> > > >>>>>>>> pirq to another controller, and its initialization fails.
> > > >>>>>>>>
> > > >>>>>>>> This is discussed in more detail at:
> > > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> > > >>>>>>>>
> > > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> > > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > >>>>>> This doesn't seem to be applied yet, is it still waiting on another
> > > >>>>>> ack?  Or maybe I'm looking at the wrong git tree...
> > > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
> > > >>>> Konrad is away this week but since pull request for Xen tree just went
> > > >>>> out we should probably wait until rc1 anyway (unless something big comes
> > > >>>> up before that).
> > > >>> I assume this should go via the Xen or x86 tree, since that's how most
> > > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> > > >>> If you think otherwise, let me know.
> > > >> OK, I applied it to Xen tree's for-linus-4.11.
> > > > Hm, we want this (c74fd80f2f4) in stable too, don't we?
> > > 
> > > 
> > > Maybe.
> > > 
> > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html
> > > it may break things on older (4.4-) hypervisors. They are out of
> > > support, which is why this patch went in now but I am not sure this
> > > automatically applies to stable kernels.
> > > 
> > > Stefano?
> > 
> > This is a difficult call. We could just say that all the broken Xen
> > versions are out of support, so let's fix all the Linux kernel stable
> > trees that we can.
> > 
> > Or we could give a look at the release dates. Linux 3.18 is still
> > maintained and was tagged on Dec 7 2014.
> 
> Don't do anything "special" for 3.18 if you have to.  I'm only
> semi-maintaining it because some SoC vendors never upstreamed their
> trees and lots of devices rely on it.  None of them use Xen on their
> platforms, so no need for me to backport any change there.

Thanks Greg, that is good info. Is 4.4 the oldest Linux tree fully
maintained?

If so, I think we should just backport this fix to all Linux trees >=
4.4, given that Linux 4.4 is from Jan 2016.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data
  2017-05-03 23:12                                                       ` Stefano Stabellini
@ 2017-05-03 23:19                                                         ` Greg KH
  0 siblings, 0 replies; 34+ messages in thread
From: Greg KH @ 2017-05-03 23:19 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Boris Ostrovsky, David Woodhouse, Bjorn Helgaas, stable,
	Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel,
	Dan Streetman, Dan Streetman

On Wed, May 03, 2017 at 04:12:29PM -0700, Stefano Stabellini wrote:
> On Wed, 3 May 2017, Greg KH wrote:
> > On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote:
> > > On Wed, 3 May 2017, Boris Ostrovsky wrote:
> > > > On 05/03/2017 02:19 PM, David Woodhouse wrote:
> > > > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote:
> > > > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote:
> > > > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote:
> > > > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote:
> > > > >>>>> On 21/02/17 16:31, Dan Streetman wrote:
> > > > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk
> > > > >>>>>> <konrad.wilk@oracle.com> wrote:
> > > > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote:
> > > > >>>>>>>> Revert the main part of commit:
> > > > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > > >>>>>>>>
> > > > >>>>>>>> That commit introduced reading the pci device's msi message data to see
> > > > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use
> > > > >>>>>>>> that pirq.  At the time, that was the correct behavior.  However, a
> > > > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
> > > > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
> > > > >>>>>>>> vectors; specifically the Qemu commit:
> > > > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad
> > > > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
> > > > >>>>>>>>
> > > > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
> > > > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message
> > > > >>>>>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
> > > > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors.
> > > > >>>>>>>>
> > > > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers
> > > > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
> > > > >>>>>>>> each controller (concurrently), and then after using that to talk to
> > > > >>>>>>>> the controller for some configuration data, it disables the single MSIX
> > > > >>>>>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
> > > > >>>>>>>> setup code tries to re-use the cached pirq from the first vector
> > > > >>>>>>>> for each controller, but the hypervisor has already given away that
> > > > >>>>>>>> pirq to another controller, and its initialization fails.
> > > > >>>>>>>>
> > > > >>>>>>>> This is discussed in more detail at:
> > > > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
> > > > >>>>>>>>
> > > > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
> > > > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
> > > > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > >>>>>> This doesn't seem to be applied yet, is it still waiting on another
> > > > >>>>>> ack?  Or maybe I'm looking at the wrong git tree...
> > > > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad?
> > > > >>>> Konrad is away this week but since pull request for Xen tree just went
> > > > >>>> out we should probably wait until rc1 anyway (unless something big comes
> > > > >>>> up before that).
> > > > >>> I assume this should go via the Xen or x86 tree, since that's how most
> > > > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a.
> > > > >>> If you think otherwise, let me know.
> > > > >> OK, I applied it to Xen tree's for-linus-4.11.
> > > > > Hm, we want this (c74fd80f2f4) in stable too, don't we?
> > > > 
> > > > 
> > > > Maybe.
> > > > 
> > > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html
> > > > it may break things on older (4.4-) hypervisors. They are out of
> > > > support, which is why this patch went in now but I am not sure this
> > > > automatically applies to stable kernels.
> > > > 
> > > > Stefano?
> > > 
> > > This is a difficult call. We could just say that all the broken Xen
> > > versions are out of support, so let's fix all the Linux kernel stable
> > > trees that we can.
> > > 
> > > Or we could give a look at the release dates. Linux 3.18 is still
> > > maintained and was tagged on Dec 7 2014.
> > 
> > Don't do anything "special" for 3.18 if you have to.  I'm only
> > semi-maintaining it because some SoC vendors never upstreamed their
> > trees and lots of devices rely on it.  None of them use Xen on their
> > platforms, so no need for me to backport any change there.
> 
> Thanks Greg, that is good info. Is 4.4 the oldest Linux tree fully
> maintained?

Well, the one that _I_ fully maintain :)

There are some older ones, but you are free to say how far back any
patch should go, you're the maintainers of this code, not me...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-05-03 23:19 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05 19:28 [PATCH] xen: do not re-use pirq number cached in pci device msi msg data Dan Streetman
2017-01-07  1:06 ` Konrad Rzeszutek Wilk
2017-01-09 14:59   ` Boris Ostrovsky
2017-01-09 15:42     ` [Xen-devel] " Dan Streetman
2017-01-09 15:59       ` Konrad Rzeszutek Wilk
2017-01-09 19:30         ` Stefano Stabellini
2017-01-10 15:57           ` Dan Streetman
2017-01-10 18:41             ` Dan Streetman
2017-01-10 19:03               ` Stefano Stabellini
2017-01-10 21:32                 ` Dan Streetman
2017-01-10 23:28                   ` Boris Ostrovsky
2017-01-11  1:25                   ` Stefano Stabellini
2017-01-11 15:26                     ` Dan Streetman
2017-01-11 18:46                       ` Stefano Stabellini
2017-01-11 23:25                         ` Dan Streetman
2017-01-13 18:31                           ` Dan Streetman
2017-01-13 18:44                             ` Stefano Stabellini
2017-01-13 20:00                               ` Boris Ostrovsky
2017-01-13 20:07                                 ` Dan Streetman
2017-01-13 20:54                                   ` Stefano Stabellini
2017-01-13 21:49                                   ` Boris Ostrovsky
2017-01-13 22:30                                   ` Konrad Rzeszutek Wilk
2017-02-21 15:31                                     ` Dan Streetman
2017-02-21 15:45                                       ` Juergen Gross
2017-02-21 15:58                                         ` Boris Ostrovsky
2017-02-22 14:28                                           ` Bjorn Helgaas
2017-02-22 15:14                                             ` Boris Ostrovsky
2017-05-03 18:19                                               ` [Xen-devel] " David Woodhouse
2017-05-03 18:43                                                 ` Boris Ostrovsky
2017-05-03 22:59                                                   ` Stefano Stabellini
2017-05-03 23:06                                                     ` Greg KH
2017-05-03 23:12                                                       ` Stefano Stabellini
2017-05-03 23:19                                                         ` Greg KH
2017-01-13 20:13                                 ` Dan Streetman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).