* [PATCH] xen: do not re-use pirq number cached in pci device msi msg data @ 2017-01-05 19:28 Dan Streetman 2017-01-07 1:06 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-05 19:28 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-pci, linux-kernel Do not read a pci device's msi message data to see if a pirq was previously configured for the device's msi/msix, as the old pirq was unmapped and may now be in use by another pci device. The previous pirq should never be re-used; instead a new pirq should always be allocated from the hypervisor. The xen_hvm_setup_msi_irqs() function currently checks the pci device's msi descriptor message data for each msi/msix vector it sets up, and if it finds the vector was previously configured with a pirq, and that pirq is mapped to an irq, it re-uses the pirq instead of requesting a new pirq from the hypervisor. However, that pirq was unmapped when the pci device disabled its msi/msix, and it cannot be re-used; it may have been given to a different pci device. This exact situation is happening in a Xen guest where multiple NVMe controllers (pci devices) are present. The NVMe driver configures each pci device's msi/msix twice; first to configure a single vector (to talk to the controller for its configuration info), and then it disables that msi/msix and re-configures with all the msi/msix it needs. When multiple NVMe controllers are present, this happens concurrently on all of them, and in the time between controller A calling pci_disable_msix() and then calling pci_enable_msix_range(), controller B enables its msix and gets controller A's pirq allocated from the hypervisor. Then when controller A re-configures its msix, its first vector tries to re-use the same pirq that it had before; but that pirq was allocated to controller B, and thus the Xen event channel for controller A's re-used pirq fails to map its irq to that pirq; the hypervisor already has the pirq mapped elsewhere. Signed-off-by: Dan Streetman <dan.streetman@canonical.com> --- arch/x86/pci/xen.c | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index bedfab9..a00a6c0 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) return 1; for_each_pci_msi_entry(msidesc, dev) { - __pci_read_msi_msg(msidesc, &msg); - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); - if (msg.data != XEN_PIRQ_MSI_DATA || - xen_irq_from_pirq(pirq) < 0) { - pirq = xen_allocate_pirq_msi(dev, msidesc); - if (pirq < 0) { - irq = -ENODEV; - goto error; - } - xen_msi_compose_msg(dev, pirq, &msg); - __pci_write_msi_msg(msidesc, &msg); - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); - } else { - dev_dbg(&dev->dev, - "xen: msi already bound to pirq=%d\n", pirq); + pirq = xen_allocate_pirq_msi(dev, msidesc); + if (pirq < 0) { + irq = -ENODEV; + goto error; } + xen_msi_compose_msg(dev, pirq, &msg); + __pci_write_msi_msg(msidesc, &msg); + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, (type == PCI_CAP_ID_MSI) ? nvec : 1, (type == PCI_CAP_ID_MSIX) ? -- 2.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-05 19:28 [PATCH] xen: do not re-use pirq number cached in pci device msi msg data Dan Streetman @ 2017-01-07 1:06 ` Konrad Rzeszutek Wilk 2017-01-09 14:59 ` Boris Ostrovsky 0 siblings, 1 reply; 34+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-01-07 1:06 UTC (permalink / raw) To: Dan Streetman Cc: Bjorn Helgaas, xen-devel, linux-pci, linux-kernel, boris.ostrovsky On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > Do not read a pci device's msi message data to see if a pirq was > previously configured for the device's msi/msix, as the old pirq was > unmapped and may now be in use by another pci device. The previous > pirq should never be re-used; instead a new pirq should always be > allocated from the hypervisor. Won't this cause a starvation problem? That is we will run out of PIRQs as we are not reusing them? > > The xen_hvm_setup_msi_irqs() function currently checks the pci device's > msi descriptor message data for each msi/msix vector it sets up, and if > it finds the vector was previously configured with a pirq, and that pirq > is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > from the hypervisor. However, that pirq was unmapped when the pci device > disabled its msi/msix, and it cannot be re-used; it may have been given > to a different pci device. Hm, but this implies that we do keep track of it. while (true) do rmmod nvme modprobe nvme done Things go boom without this patch. But with this patch does this still work? As in we don't run out of PIRQs? Thanks. > > This exact situation is happening in a Xen guest where multiple NVMe > controllers (pci devices) are present. The NVMe driver configures each > pci device's msi/msix twice; first to configure a single vector (to > talk to the controller for its configuration info), and then it disables > that msi/msix and re-configures with all the msi/msix it needs. When > multiple NVMe controllers are present, this happens concurrently on all > of them, and in the time between controller A calling pci_disable_msix() > and then calling pci_enable_msix_range(), controller B enables its msix > and gets controller A's pirq allocated from the hypervisor. Then when > controller A re-configures its msix, its first vector tries to re-use > the same pirq that it had before; but that pirq was allocated to > controller B, and thus the Xen event channel for controller A's re-used > pirq fails to map its irq to that pirq; the hypervisor already has the > pirq mapped elsewhere. > > Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > --- > arch/x86/pci/xen.c | 23 +++++++---------------- > 1 file changed, 7 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > index bedfab9..a00a6c0 100644 > --- a/arch/x86/pci/xen.c > +++ b/arch/x86/pci/xen.c > @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > return 1; > > for_each_pci_msi_entry(msidesc, dev) { > - __pci_read_msi_msg(msidesc, &msg); > - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > - if (msg.data != XEN_PIRQ_MSI_DATA || > - xen_irq_from_pirq(pirq) < 0) { > - pirq = xen_allocate_pirq_msi(dev, msidesc); > - if (pirq < 0) { > - irq = -ENODEV; > - goto error; > - } > - xen_msi_compose_msg(dev, pirq, &msg); > - __pci_write_msi_msg(msidesc, &msg); > - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > - } else { > - dev_dbg(&dev->dev, > - "xen: msi already bound to pirq=%d\n", pirq); > + pirq = xen_allocate_pirq_msi(dev, msidesc); > + if (pirq < 0) { > + irq = -ENODEV; > + goto error; > } > + xen_msi_compose_msg(dev, pirq, &msg); > + __pci_write_msi_msg(msidesc, &msg); > + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > (type == PCI_CAP_ID_MSI) ? nvec : 1, > (type == PCI_CAP_ID_MSIX) ? > -- > 2.9.3 > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-07 1:06 ` Konrad Rzeszutek Wilk @ 2017-01-09 14:59 ` Boris Ostrovsky 2017-01-09 15:42 ` [Xen-devel] " Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Boris Ostrovsky @ 2017-01-09 14:59 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, Dan Streetman Cc: Bjorn Helgaas, xen-devel, linux-pci, linux-kernel On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> Do not read a pci device's msi message data to see if a pirq was >> previously configured for the device's msi/msix, as the old pirq was >> unmapped and may now be in use by another pci device. The previous >> pirq should never be re-used; instead a new pirq should always be >> allocated from the hypervisor. > Won't this cause a starvation problem? That is we will run out of PIRQs > as we are not reusing them? Don't we free the pirq when we unmap it? -boris >> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> msi descriptor message data for each msi/msix vector it sets up, and if >> it finds the vector was previously configured with a pirq, and that pirq >> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> from the hypervisor. However, that pirq was unmapped when the pci device >> disabled its msi/msix, and it cannot be re-used; it may have been given >> to a different pci device. > Hm, but this implies that we do keep track of it. > > > while (true) > do > rmmod nvme > modprobe nvme > done > > Things go boom without this patch. But with this patch does this > still work? As in we don't run out of PIRQs? > > Thanks. >> This exact situation is happening in a Xen guest where multiple NVMe >> controllers (pci devices) are present. The NVMe driver configures each >> pci device's msi/msix twice; first to configure a single vector (to >> talk to the controller for its configuration info), and then it disables >> that msi/msix and re-configures with all the msi/msix it needs. When >> multiple NVMe controllers are present, this happens concurrently on all >> of them, and in the time between controller A calling pci_disable_msix() >> and then calling pci_enable_msix_range(), controller B enables its msix >> and gets controller A's pirq allocated from the hypervisor. Then when >> controller A re-configures its msix, its first vector tries to re-use >> the same pirq that it had before; but that pirq was allocated to >> controller B, and thus the Xen event channel for controller A's re-used >> pirq fails to map its irq to that pirq; the hypervisor already has the >> pirq mapped elsewhere. >> >> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> --- >> arch/x86/pci/xen.c | 23 +++++++---------------- >> 1 file changed, 7 insertions(+), 16 deletions(-) >> >> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> index bedfab9..a00a6c0 100644 >> --- a/arch/x86/pci/xen.c >> +++ b/arch/x86/pci/xen.c >> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> return 1; >> >> for_each_pci_msi_entry(msidesc, dev) { >> - __pci_read_msi_msg(msidesc, &msg); >> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> - if (msg.data != XEN_PIRQ_MSI_DATA || >> - xen_irq_from_pirq(pirq) < 0) { >> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> - if (pirq < 0) { >> - irq = -ENODEV; >> - goto error; >> - } >> - xen_msi_compose_msg(dev, pirq, &msg); >> - __pci_write_msi_msg(msidesc, &msg); >> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> - } else { >> - dev_dbg(&dev->dev, >> - "xen: msi already bound to pirq=%d\n", pirq); >> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> + if (pirq < 0) { >> + irq = -ENODEV; >> + goto error; >> } >> + xen_msi_compose_msg(dev, pirq, &msg); >> + __pci_write_msi_msg(msidesc, &msg); >> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> (type == PCI_CAP_ID_MSIX) ? >> -- >> 2.9.3 >> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-09 14:59 ` Boris Ostrovsky @ 2017-01-09 15:42 ` Dan Streetman 2017-01-09 15:59 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-09 15:42 UTC (permalink / raw) To: Boris Ostrovsky Cc: Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> Do not read a pci device's msi message data to see if a pirq was >>> previously configured for the device's msi/msix, as the old pirq was >>> unmapped and may now be in use by another pci device. The previous >>> pirq should never be re-used; instead a new pirq should always be >>> allocated from the hypervisor. >> Won't this cause a starvation problem? That is we will run out of PIRQs >> as we are not reusing them? > > Don't we free the pirq when we unmap it? I think this is actually a bit worse than I initially thought. After looking a bit closer, and I think there's an asymmetry with pirq allocation: tl;dr: pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq allocated, and reserved in the hypervisor request_irq() -> an event channel is opened for the specific pirq, and maps the pirq with the hypervisor free_irq() -> the event channel is closed, and the pirq is unmapped, but that unmap function also frees the pirq! The hypervisor can/will give it away to the next call to get_free_pirq. However, the pci msi/msix data area still contains the pirq number, and the next call to request_irq() will re-use the pirq. pci_disable_msix() -> this has no effect on the pirq in the hypervisor (it's already been freed), and it also doesn't clear anything from the msi/msix data area, so the pirq is still cached there. It seems like the hypervisor needs to be fixed to *not* unmap the pirq when the event channel is closed - or at least, not to change it to IRQ_UNBOUND state? And, the pci_disable_msix call should eventually call into something in the Xen guest kernel that actually does the pirq unmapping, and clear it from the msi data area (i.e. pci_write_msi_msg) Alternately, if the hypervisor doesn't change, then the call into the hypervisor to actually allocate a pirq needs to move from the pci_enable_msix_range() call to the request_irq() call? So that when the msi/msix range is enabled, it doesn't actually reserve any pirq's for any of the vectors; each request_irq/free_irq pair do the pirq allocate-and-map/unmap... longer details: The chain of function calls starts in the initial call to configure the msi vectors, which eventually calls __pci_enable_msix_range (or msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which either tries to re-use any cached pirq in the MSI data area, or (for the first time setup) allocates a new pirq from the hypervisor via PHYSDEVOP_get_free_pirq. That pirq is then reserved from the hypervisor's perspective, but it's not mapped to anything in the guest kernel. Then, the driver calls request_irq to actually start using the irq, which calls __setup_irq to irq_startup to startup_pirq. The startup_pirq call actually creates the evtchn and binds it to the previously allocated pirq via EVTCHNOP_bind_pirq. At this point, the pirq is bound to a guest kernel evtchn (and irq) and is in use. But then, when the driver doesn't want it anymore, it calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that function closes the evtchn via EVTCHNOP_close. Inside the hypervisor, in xen/common/event_channel.c in evtchn_close(), if the channel is type ECS_PIRQ (which our pirq channel is) then it unmaps the pirq mapping via unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back to state IRQ_UNBOUND, which makes it available for the hypervisor to give away to anyone requesting a new pirq! > > -boris > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >>> msi descriptor message data for each msi/msix vector it sets up, and if >>> it finds the vector was previously configured with a pirq, and that pirq >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >>> from the hypervisor. However, that pirq was unmapped when the pci device >>> disabled its msi/msix, and it cannot be re-used; it may have been given >>> to a different pci device. >> Hm, but this implies that we do keep track of it. >> >> >> while (true) >> do >> rmmod nvme >> modprobe nvme >> done >> >> Things go boom without this patch. But with this patch does this >> still work? As in we don't run out of PIRQs? >> >> Thanks. >>> This exact situation is happening in a Xen guest where multiple NVMe >>> controllers (pci devices) are present. The NVMe driver configures each >>> pci device's msi/msix twice; first to configure a single vector (to >>> talk to the controller for its configuration info), and then it disables >>> that msi/msix and re-configures with all the msi/msix it needs. When >>> multiple NVMe controllers are present, this happens concurrently on all >>> of them, and in the time between controller A calling pci_disable_msix() >>> and then calling pci_enable_msix_range(), controller B enables its msix >>> and gets controller A's pirq allocated from the hypervisor. Then when >>> controller A re-configures its msix, its first vector tries to re-use >>> the same pirq that it had before; but that pirq was allocated to >>> controller B, and thus the Xen event channel for controller A's re-used >>> pirq fails to map its irq to that pirq; the hypervisor already has the >>> pirq mapped elsewhere. >>> >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>> --- >>> arch/x86/pci/xen.c | 23 +++++++---------------- >>> 1 file changed, 7 insertions(+), 16 deletions(-) >>> >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >>> index bedfab9..a00a6c0 100644 >>> --- a/arch/x86/pci/xen.c >>> +++ b/arch/x86/pci/xen.c >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >>> return 1; >>> >>> for_each_pci_msi_entry(msidesc, dev) { >>> - __pci_read_msi_msg(msidesc, &msg); >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >>> - xen_irq_from_pirq(pirq) < 0) { >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >>> - if (pirq < 0) { >>> - irq = -ENODEV; >>> - goto error; >>> - } >>> - xen_msi_compose_msg(dev, pirq, &msg); >>> - __pci_write_msi_msg(msidesc, &msg); >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> - } else { >>> - dev_dbg(&dev->dev, >>> - "xen: msi already bound to pirq=%d\n", pirq); >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >>> + if (pirq < 0) { >>> + irq = -ENODEV; >>> + goto error; >>> } >>> + xen_msi_compose_msg(dev, pirq, &msg); >>> + __pci_write_msi_msg(msidesc, &msg); >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >>> (type == PCI_CAP_ID_MSIX) ? >>> -- >>> 2.9.3 >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-09 15:42 ` [Xen-devel] " Dan Streetman @ 2017-01-09 15:59 ` Konrad Rzeszutek Wilk 2017-01-09 19:30 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-01-09 15:59 UTC (permalink / raw) To: Dan Streetman, stefano Cc: Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > <boris.ostrovsky@oracle.com> wrote: > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> Do not read a pci device's msi message data to see if a pirq was > >>> previously configured for the device's msi/msix, as the old pirq was > >>> unmapped and may now be in use by another pci device. The previous > >>> pirq should never be re-used; instead a new pirq should always be > >>> allocated from the hypervisor. > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> as we are not reusing them? > > > > Don't we free the pirq when we unmap it? > > I think this is actually a bit worse than I initially thought. After > looking a bit closer, and I think there's an asymmetry with pirq > allocation: Lets include Stefano, Thank you for digging in this! This has quite the deja-vu feeling as I believe I hit this at some point in the past and posted some possible ways of fixing this. But sadly I can't find the thread. > > tl;dr: > > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq > allocated, and reserved in the hypervisor > > request_irq() -> an event channel is opened for the specific pirq, and > maps the pirq with the hypervisor > > free_irq() -> the event channel is closed, and the pirq is unmapped, > but that unmap function also frees the pirq! The hypervisor can/will > give it away to the next call to get_free_pirq. However, the pci > msi/msix data area still contains the pirq number, and the next call > to request_irq() will re-use the pirq. > > pci_disable_msix() -> this has no effect on the pirq in the hypervisor > (it's already been freed), and it also doesn't clear anything from the > msi/msix data area, so the pirq is still cached there. > > > It seems like the hypervisor needs to be fixed to *not* unmap the pirq > when the event channel is closed - or at least, not to change it to > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually > call into something in the Xen guest kernel that actually does the > pirq unmapping, and clear it from the msi data area (i.e. > pci_write_msi_msg) The problem is that Xen changes have sailed a long long time ago. > > Alternately, if the hypervisor doesn't change, then the call into the > hypervisor to actually allocate a pirq needs to move from the > pci_enable_msix_range() call to the request_irq() call? So that when > the msi/msix range is enabled, it doesn't actually reserve any pirq's > for any of the vectors; each request_irq/free_irq pair do the pirq > allocate-and-map/unmap... Or a third one: We keep an pirq->device lookup and inhibit free_irq() from actually calling evtchn_close() until the pci_disable_msix() is done? > > > longer details: > > The chain of function calls starts in the initial call to configure > the msi vectors, which eventually calls __pci_enable_msix_range (or > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which > either tries to re-use any cached pirq in the MSI data area, or (for > the first time setup) allocates a new pirq from the hypervisor via > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the > hypervisor's perspective, but it's not mapped to anything in the guest > kernel. > > Then, the driver calls request_irq to actually start using the irq, > which calls __setup_irq to irq_startup to startup_pirq. The > startup_pirq call actually creates the evtchn and binds it to the > previously allocated pirq via EVTCHNOP_bind_pirq. > > At this point, the pirq is bound to a guest kernel evtchn (and irq) > and is in use. But then, when the driver doesn't want it anymore, it > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that > function closes the evtchn via EVTCHNOP_close. > > Inside the hypervisor, in xen/common/event_channel.c in > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq > channel is) then it unmaps the pirq mapping via > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back > to state IRQ_UNBOUND, which makes it available for the hypervisor to > give away to anyone requesting a new pirq! > > > > > > > > > -boris > > > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's > >>> msi descriptor message data for each msi/msix vector it sets up, and if > >>> it finds the vector was previously configured with a pirq, and that pirq > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > >>> from the hypervisor. However, that pirq was unmapped when the pci device > >>> disabled its msi/msix, and it cannot be re-used; it may have been given > >>> to a different pci device. > >> Hm, but this implies that we do keep track of it. > >> > >> > >> while (true) > >> do > >> rmmod nvme > >> modprobe nvme > >> done > >> > >> Things go boom without this patch. But with this patch does this > >> still work? As in we don't run out of PIRQs? > >> > >> Thanks. > >>> This exact situation is happening in a Xen guest where multiple NVMe > >>> controllers (pci devices) are present. The NVMe driver configures each > >>> pci device's msi/msix twice; first to configure a single vector (to > >>> talk to the controller for its configuration info), and then it disables > >>> that msi/msix and re-configures with all the msi/msix it needs. When > >>> multiple NVMe controllers are present, this happens concurrently on all > >>> of them, and in the time between controller A calling pci_disable_msix() > >>> and then calling pci_enable_msix_range(), controller B enables its msix > >>> and gets controller A's pirq allocated from the hypervisor. Then when > >>> controller A re-configures its msix, its first vector tries to re-use > >>> the same pirq that it had before; but that pirq was allocated to > >>> controller B, and thus the Xen event channel for controller A's re-used > >>> pirq fails to map its irq to that pirq; the hypervisor already has the > >>> pirq mapped elsewhere. > >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > >>> --- > >>> arch/x86/pci/xen.c | 23 +++++++---------------- > >>> 1 file changed, 7 insertions(+), 16 deletions(-) > >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > >>> index bedfab9..a00a6c0 100644 > >>> --- a/arch/x86/pci/xen.c > >>> +++ b/arch/x86/pci/xen.c > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > >>> return 1; > >>> > >>> for_each_pci_msi_entry(msidesc, dev) { > >>> - __pci_read_msi_msg(msidesc, &msg); > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || > >>> - xen_irq_from_pirq(pirq) < 0) { > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> - if (pirq < 0) { > >>> - irq = -ENODEV; > >>> - goto error; > >>> - } > >>> - xen_msi_compose_msg(dev, pirq, &msg); > >>> - __pci_write_msi_msg(msidesc, &msg); > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> - } else { > >>> - dev_dbg(&dev->dev, > >>> - "xen: msi already bound to pirq=%d\n", pirq); > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> + if (pirq < 0) { > >>> + irq = -ENODEV; > >>> + goto error; > >>> } > >>> + xen_msi_compose_msg(dev, pirq, &msg); > >>> + __pci_write_msi_msg(msidesc, &msg); > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, > >>> (type == PCI_CAP_ID_MSIX) ? > >>> -- > >>> 2.9.3 > >>> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-09 15:59 ` Konrad Rzeszutek Wilk @ 2017-01-09 19:30 ` Stefano Stabellini 2017-01-10 15:57 ` Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-01-09 19:30 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Dan Streetman, stefano, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > > <boris.ostrovsky@oracle.com> wrote: > > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > > >>> Do not read a pci device's msi message data to see if a pirq was > > >>> previously configured for the device's msi/msix, as the old pirq was > > >>> unmapped and may now be in use by another pci device. The previous > > >>> pirq should never be re-used; instead a new pirq should always be > > >>> allocated from the hypervisor. > > >> Won't this cause a starvation problem? That is we will run out of PIRQs > > >> as we are not reusing them? > > > > > > Don't we free the pirq when we unmap it? > > > > I think this is actually a bit worse than I initially thought. After > > looking a bit closer, and I think there's an asymmetry with pirq > > allocation: > > Lets include Stefano, > > Thank you for digging in this! This has quite the deja-vu > feeling as I believe I hit this at some point in the past and > posted some possible ways of fixing this. But sadly I can't > find the thread. This issue seems to be caused by: commit af42b8d12f8adec6711cb824549a0edac6a4ae8f Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Date: Wed Dec 1 14:51:44 2010 +0000 xen: fix MSI setup and teardown for PV on HVM guests which was a fix to a bug: This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when trying to enable the same MSI for the second time: the old MSI to pirq mapping is still valid at this point but xen_hvm_setup_msi_irqs would try to assign a new pirq anyway. A simple way to reproduce this bug is to assign an MSI capable network card to a PV on HVM guest, if the user brings down the corresponding ethernet interface and up again, Linux would fail to enable MSIs on the device. I don't remember any of the details. From the description of this bug, it seems that Xen changed behavior in the past few years: before it used to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the old MSI to pirq mapping is still valid at this point", the pirq couldn't have been completely freed, then reassigned to somebody else the way it is described in this email. I think we should indentify the changeset or Xen version that introduced the new behavior. If it is old enough, we might be able to just revert af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the behavior conditional to the Xen version. > > tl;dr: > > > > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq > > allocated, and reserved in the hypervisor > > > > request_irq() -> an event channel is opened for the specific pirq, and > > maps the pirq with the hypervisor > > > > free_irq() -> the event channel is closed, and the pirq is unmapped, > > but that unmap function also frees the pirq! The hypervisor can/will > > give it away to the next call to get_free_pirq. However, the pci > > msi/msix data area still contains the pirq number, and the next call > > to request_irq() will re-use the pirq. > > > > pci_disable_msix() -> this has no effect on the pirq in the hypervisor > > (it's already been freed), and it also doesn't clear anything from the > > msi/msix data area, so the pirq is still cached there. > > > > > > It seems like the hypervisor needs to be fixed to *not* unmap the pirq > > when the event channel is closed - or at least, not to change it to > > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually > > call into something in the Xen guest kernel that actually does the > > pirq unmapping, and clear it from the msi data area (i.e. > > pci_write_msi_msg) > > The problem is that Xen changes have sailed a long long time ago. > > > > Alternately, if the hypervisor doesn't change, then the call into the > > hypervisor to actually allocate a pirq needs to move from the > > pci_enable_msix_range() call to the request_irq() call? So that when > > the msi/msix range is enabled, it doesn't actually reserve any pirq's > > for any of the vectors; each request_irq/free_irq pair do the pirq > > allocate-and-map/unmap... > > > Or a third one: We keep an pirq->device lookup and inhibit free_irq() > from actually calling evtchn_close() until the pci_disable_msix() is done? I think that's a reasonable alternative: we mask the evtchn, but do not call xen_evtchn_close in shutdown_pirq for PV on HVM guests. Something like (not compiled, untested): diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 137bd0e..3174923 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) return; mask_evtchn(evtchn); - xen_evtchn_close(evtchn); + if (!xen_hvm_domain()) + xen_evtchn_close(evtchn); xen_irq_info_cleanup(info); } We want to keep the pirq allocated to the device - not closing the evtchn seems like the right thing to do. I suggest we test for the original bug too: enable/disable the network interface of an MSI capable network card. > > longer details: > > > > The chain of function calls starts in the initial call to configure > > the msi vectors, which eventually calls __pci_enable_msix_range (or > > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which > > either tries to re-use any cached pirq in the MSI data area, or (for > > the first time setup) allocates a new pirq from the hypervisor via > > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the > > hypervisor's perspective, but it's not mapped to anything in the guest > > kernel. > > > > Then, the driver calls request_irq to actually start using the irq, > > which calls __setup_irq to irq_startup to startup_pirq. The > > startup_pirq call actually creates the evtchn and binds it to the > > previously allocated pirq via EVTCHNOP_bind_pirq. > > > > At this point, the pirq is bound to a guest kernel evtchn (and irq) > > and is in use. But then, when the driver doesn't want it anymore, it > > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that > > function closes the evtchn via EVTCHNOP_close. > > > > Inside the hypervisor, in xen/common/event_channel.c in > > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq > > channel is) then it unmaps the pirq mapping via > > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back > > to state IRQ_UNBOUND, which makes it available for the hypervisor to > > give away to anyone requesting a new pirq! > > > > > > > > > > > > > > > > -boris > > > > > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's > > >>> msi descriptor message data for each msi/msix vector it sets up, and if > > >>> it finds the vector was previously configured with a pirq, and that pirq > > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > > >>> from the hypervisor. However, that pirq was unmapped when the pci device > > >>> disabled its msi/msix, and it cannot be re-used; it may have been given > > >>> to a different pci device. > > >> Hm, but this implies that we do keep track of it. > > >> > > >> > > >> while (true) > > >> do > > >> rmmod nvme > > >> modprobe nvme > > >> done > > >> > > >> Things go boom without this patch. But with this patch does this > > >> still work? As in we don't run out of PIRQs? > > >> > > >> Thanks. > > >>> This exact situation is happening in a Xen guest where multiple NVMe > > >>> controllers (pci devices) are present. The NVMe driver configures each > > >>> pci device's msi/msix twice; first to configure a single vector (to > > >>> talk to the controller for its configuration info), and then it disables > > >>> that msi/msix and re-configures with all the msi/msix it needs. When > > >>> multiple NVMe controllers are present, this happens concurrently on all > > >>> of them, and in the time between controller A calling pci_disable_msix() > > >>> and then calling pci_enable_msix_range(), controller B enables its msix > > >>> and gets controller A's pirq allocated from the hypervisor. Then when > > >>> controller A re-configures its msix, its first vector tries to re-use > > >>> the same pirq that it had before; but that pirq was allocated to > > >>> controller B, and thus the Xen event channel for controller A's re-used > > >>> pirq fails to map its irq to that pirq; the hypervisor already has the > > >>> pirq mapped elsewhere. > > >>> > > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > >>> --- > > >>> arch/x86/pci/xen.c | 23 +++++++---------------- > > >>> 1 file changed, 7 insertions(+), 16 deletions(-) > > >>> > > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > > >>> index bedfab9..a00a6c0 100644 > > >>> --- a/arch/x86/pci/xen.c > > >>> +++ b/arch/x86/pci/xen.c > > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > > >>> return 1; > > >>> > > >>> for_each_pci_msi_entry(msidesc, dev) { > > >>> - __pci_read_msi_msg(msidesc, &msg); > > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || > > >>> - xen_irq_from_pirq(pirq) < 0) { > > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); > > >>> - if (pirq < 0) { > > >>> - irq = -ENODEV; > > >>> - goto error; > > >>> - } > > >>> - xen_msi_compose_msg(dev, pirq, &msg); > > >>> - __pci_write_msi_msg(msidesc, &msg); > > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > > >>> - } else { > > >>> - dev_dbg(&dev->dev, > > >>> - "xen: msi already bound to pirq=%d\n", pirq); > > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); > > >>> + if (pirq < 0) { > > >>> + irq = -ENODEV; > > >>> + goto error; > > >>> } > > >>> + xen_msi_compose_msg(dev, pirq, &msg); > > >>> + __pci_write_msi_msg(msidesc, &msg); > > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, > > >>> (type == PCI_CAP_ID_MSIX) ? > > >>> -- > > >>> 2.9.3 > > >>> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xen.org > > > https://lists.xen.org/xen-devel > ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-09 19:30 ` Stefano Stabellini @ 2017-01-10 15:57 ` Dan Streetman 2017-01-10 18:41 ` Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-10 15:57 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, stefano, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> > <boris.ostrovsky@oracle.com> wrote: >> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> > >>> Do not read a pci device's msi message data to see if a pirq was >> > >>> previously configured for the device's msi/msix, as the old pirq was >> > >>> unmapped and may now be in use by another pci device. The previous >> > >>> pirq should never be re-used; instead a new pirq should always be >> > >>> allocated from the hypervisor. >> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> > >> as we are not reusing them? >> > > >> > > Don't we free the pirq when we unmap it? >> > >> > I think this is actually a bit worse than I initially thought. After >> > looking a bit closer, and I think there's an asymmetry with pirq >> > allocation: >> >> Lets include Stefano, >> >> Thank you for digging in this! This has quite the deja-vu >> feeling as I believe I hit this at some point in the past and >> posted some possible ways of fixing this. But sadly I can't >> find the thread. > > This issue seems to be caused by: > > commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > Date: Wed Dec 1 14:51:44 2010 +0000 > > xen: fix MSI setup and teardown for PV on HVM guests > > which was a fix to a bug: > > This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > trying to enable the same MSI for the second time: the old MSI to pirq > mapping is still valid at this point but xen_hvm_setup_msi_irqs would > try to assign a new pirq anyway. > A simple way to reproduce this bug is to assign an MSI capable network > card to a PV on HVM guest, if the user brings down the corresponding > ethernet interface and up again, Linux would fail to enable MSIs on the > device. > > I don't remember any of the details. From the description of this bug, > it seems that Xen changed behavior in the past few years: before it used > to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > old MSI to pirq mapping is still valid at this point", the pirq couldn't > have been completely freed, then reassigned to somebody else the way it > is described in this email. > > I think we should indentify the changeset or Xen version that introduced > the new behavior. If it is old enough, we might be able to just revert > af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > behavior conditional to the Xen version. Are PT devices the only MSI-capable devices available in a Xen guest? That's where I'm seeing this problem, with PT NVMe devices. I can say that on the Xen guest with NVMe PT devices I'm testing on, with the patch from this thread (which essentially reverts your commit above) as well as some added debug to see the pirq numbers, cycles of 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the hypervisor provides essentially the same pirqs for each modprobe, since they were freed by the rmmod. > > >> > tl;dr: >> > >> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >> > allocated, and reserved in the hypervisor >> > >> > request_irq() -> an event channel is opened for the specific pirq, and >> > maps the pirq with the hypervisor >> > >> > free_irq() -> the event channel is closed, and the pirq is unmapped, >> > but that unmap function also frees the pirq! The hypervisor can/will >> > give it away to the next call to get_free_pirq. However, the pci >> > msi/msix data area still contains the pirq number, and the next call >> > to request_irq() will re-use the pirq. >> > >> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >> > (it's already been freed), and it also doesn't clear anything from the >> > msi/msix data area, so the pirq is still cached there. >> > >> > >> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >> > when the event channel is closed - or at least, not to change it to >> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >> > call into something in the Xen guest kernel that actually does the >> > pirq unmapping, and clear it from the msi data area (i.e. >> > pci_write_msi_msg) >> >> The problem is that Xen changes have sailed a long long time ago. >> > >> > Alternately, if the hypervisor doesn't change, then the call into the >> > hypervisor to actually allocate a pirq needs to move from the >> > pci_enable_msix_range() call to the request_irq() call? So that when >> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >> > for any of the vectors; each request_irq/free_irq pair do the pirq >> > allocate-and-map/unmap... >> >> >> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >> from actually calling evtchn_close() until the pci_disable_msix() is done? > > I think that's a reasonable alternative: we mask the evtchn, but do not > call xen_evtchn_close in shutdown_pirq for PV on HVM guests. > Something like (not compiled, untested): > > diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c > index 137bd0e..3174923 100644 > --- a/drivers/xen/events/events_base.c > +++ b/drivers/xen/events/events_base.c > @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) > return; > > mask_evtchn(evtchn); > - xen_evtchn_close(evtchn); > + if (!xen_hvm_domain()) > + xen_evtchn_close(evtchn); wouldn't we need to also add code to the pci_disable_msix path that would actually close the evtchn? Would this leave the evtchn around forever? > xen_irq_info_cleanup(info); > } > > > We want to keep the pirq allocated to the device - not closing the > evtchn seems like the right thing to do. I suggest we test for the > original bug too: enable/disable the network interface of an MSI capable > network card. I don't have a Xen hypervisor setup myself, I'm just using AWS, but I'll try to test this on an instance with a SRIOV/PT nic. > > >> > longer details: >> > >> > The chain of function calls starts in the initial call to configure >> > the msi vectors, which eventually calls __pci_enable_msix_range (or >> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >> > either tries to re-use any cached pirq in the MSI data area, or (for >> > the first time setup) allocates a new pirq from the hypervisor via >> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >> > hypervisor's perspective, but it's not mapped to anything in the guest >> > kernel. >> > >> > Then, the driver calls request_irq to actually start using the irq, >> > which calls __setup_irq to irq_startup to startup_pirq. The >> > startup_pirq call actually creates the evtchn and binds it to the >> > previously allocated pirq via EVTCHNOP_bind_pirq. >> > >> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >> > and is in use. But then, when the driver doesn't want it anymore, it >> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >> > function closes the evtchn via EVTCHNOP_close. >> > >> > Inside the hypervisor, in xen/common/event_channel.c in >> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >> > channel is) then it unmaps the pirq mapping via >> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >> > give away to anyone requesting a new pirq! >> > >> > >> > >> > >> > >> > > >> > > -boris >> > > >> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >> > >>> it finds the vector was previously configured with a pirq, and that pirq >> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >> > >>> to a different pci device. >> > >> Hm, but this implies that we do keep track of it. >> > >> >> > >> >> > >> while (true) >> > >> do >> > >> rmmod nvme >> > >> modprobe nvme >> > >> done >> > >> >> > >> Things go boom without this patch. But with this patch does this >> > >> still work? As in we don't run out of PIRQs? >> > >> >> > >> Thanks. >> > >>> This exact situation is happening in a Xen guest where multiple NVMe >> > >>> controllers (pci devices) are present. The NVMe driver configures each >> > >>> pci device's msi/msix twice; first to configure a single vector (to >> > >>> talk to the controller for its configuration info), and then it disables >> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >> > >>> multiple NVMe controllers are present, this happens concurrently on all >> > >>> of them, and in the time between controller A calling pci_disable_msix() >> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >> > >>> controller A re-configures its msix, its first vector tries to re-use >> > >>> the same pirq that it had before; but that pirq was allocated to >> > >>> controller B, and thus the Xen event channel for controller A's re-used >> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >> > >>> pirq mapped elsewhere. >> > >>> >> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> > >>> --- >> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >> > >>> >> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> > >>> index bedfab9..a00a6c0 100644 >> > >>> --- a/arch/x86/pci/xen.c >> > >>> +++ b/arch/x86/pci/xen.c >> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> > >>> return 1; >> > >>> >> > >>> for_each_pci_msi_entry(msidesc, dev) { >> > >>> - __pci_read_msi_msg(msidesc, &msg); >> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >> > >>> - xen_irq_from_pirq(pirq) < 0) { >> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> > >>> - if (pirq < 0) { >> > >>> - irq = -ENODEV; >> > >>> - goto error; >> > >>> - } >> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >> > >>> - __pci_write_msi_msg(msidesc, &msg); >> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> > >>> - } else { >> > >>> - dev_dbg(&dev->dev, >> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> > >>> + if (pirq < 0) { >> > >>> + irq = -ENODEV; >> > >>> + goto error; >> > >>> } >> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >> > >>> + __pci_write_msi_msg(msidesc, &msg); >> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> > >>> (type == PCI_CAP_ID_MSIX) ? >> > >>> -- >> > >>> 2.9.3 >> > >>> >> > > >> > > >> > > _______________________________________________ >> > > Xen-devel mailing list >> > > Xen-devel@lists.xen.org >> > > https://lists.xen.org/xen-devel >> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-10 15:57 ` Dan Streetman @ 2017-01-10 18:41 ` Dan Streetman 2017-01-10 19:03 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-10 18:41 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>> > <boris.ostrovsky@oracle.com> wrote: >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> > >>> Do not read a pci device's msi message data to see if a pirq was >>> > >>> previously configured for the device's msi/msix, as the old pirq was >>> > >>> unmapped and may now be in use by another pci device. The previous >>> > >>> pirq should never be re-used; instead a new pirq should always be >>> > >>> allocated from the hypervisor. >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >>> > >> as we are not reusing them? >>> > > >>> > > Don't we free the pirq when we unmap it? >>> > >>> > I think this is actually a bit worse than I initially thought. After >>> > looking a bit closer, and I think there's an asymmetry with pirq >>> > allocation: >>> >>> Lets include Stefano, >>> >>> Thank you for digging in this! This has quite the deja-vu >>> feeling as I believe I hit this at some point in the past and >>> posted some possible ways of fixing this. But sadly I can't >>> find the thread. >> >> This issue seems to be caused by: >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> which was a fix to a bug: >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> trying to enable the same MSI for the second time: the old MSI to pirq >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> try to assign a new pirq anyway. >> A simple way to reproduce this bug is to assign an MSI capable network >> card to a PV on HVM guest, if the user brings down the corresponding >> ethernet interface and up again, Linux would fail to enable MSIs on the >> device. >> >> I don't remember any of the details. From the description of this bug, >> it seems that Xen changed behavior in the past few years: before it used >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> have been completely freed, then reassigned to somebody else the way it >> is described in this email. >> >> I think we should indentify the changeset or Xen version that introduced >> the new behavior. If it is old enough, we might be able to just revert >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> behavior conditional to the Xen version. > > Are PT devices the only MSI-capable devices available in a Xen guest? > That's where I'm seeing this problem, with PT NVMe devices. > > I can say that on the Xen guest with NVMe PT devices I'm testing on, > with the patch from this thread (which essentially reverts your commit > above) as well as some added debug to see the pirq numbers, cycles of > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > hypervisor provides essentially the same pirqs for each modprobe, > since they were freed by the rmmod. > >> >> >>> > tl;dr: >>> > >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >>> > allocated, and reserved in the hypervisor >>> > >>> > request_irq() -> an event channel is opened for the specific pirq, and >>> > maps the pirq with the hypervisor >>> > >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, >>> > but that unmap function also frees the pirq! The hypervisor can/will >>> > give it away to the next call to get_free_pirq. However, the pci >>> > msi/msix data area still contains the pirq number, and the next call >>> > to request_irq() will re-use the pirq. >>> > >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >>> > (it's already been freed), and it also doesn't clear anything from the >>> > msi/msix data area, so the pirq is still cached there. >>> > >>> > >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >>> > when the event channel is closed - or at least, not to change it to >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >>> > call into something in the Xen guest kernel that actually does the >>> > pirq unmapping, and clear it from the msi data area (i.e. >>> > pci_write_msi_msg) >>> >>> The problem is that Xen changes have sailed a long long time ago. >>> > >>> > Alternately, if the hypervisor doesn't change, then the call into the >>> > hypervisor to actually allocate a pirq needs to move from the >>> > pci_enable_msix_range() call to the request_irq() call? So that when >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >>> > for any of the vectors; each request_irq/free_irq pair do the pirq >>> > allocate-and-map/unmap... >>> >>> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >>> from actually calling evtchn_close() until the pci_disable_msix() is done? >> >> I think that's a reasonable alternative: we mask the evtchn, but do not >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. >> Something like (not compiled, untested): >> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c >> index 137bd0e..3174923 100644 >> --- a/drivers/xen/events/events_base.c >> +++ b/drivers/xen/events/events_base.c >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) >> return; >> >> mask_evtchn(evtchn); >> - xen_evtchn_close(evtchn); >> + if (!xen_hvm_domain()) >> + xen_evtchn_close(evtchn); > > wouldn't we need to also add code to the pci_disable_msix path that > would actually close the evtchn? Would this leave the evtchn around > forever? > >> xen_irq_info_cleanup(info); >> } >> >> >> We want to keep the pirq allocated to the device - not closing the >> evtchn seems like the right thing to do. I suggest we test for the >> original bug too: enable/disable the network interface of an MSI capable >> network card. > > I don't have a Xen hypervisor setup myself, I'm just using AWS, but > I'll try to test this on an instance with a SRIOV/PT nic. I started an instance with SRIOV nics, with the latest upstream kernel plus this patch and debug to show the pirqs. Taking an interface down and back up uses the same pirq, because the driver only does free_irq/request_irq, which just closes/reopens the evtchn using the same pirq (which is cached in the struct irq_info) - it doesn't disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of the driver does disable and reenable the MSIX vectors, but the hypervisor provides the same pirqs from the get_free_pirq() call that were used before (since nothing else is asking for free pirqs). > >> >> >>> > longer details: >>> > >>> > The chain of function calls starts in the initial call to configure >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >>> > either tries to re-use any cached pirq in the MSI data area, or (for >>> > the first time setup) allocates a new pirq from the hypervisor via >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >>> > hypervisor's perspective, but it's not mapped to anything in the guest >>> > kernel. >>> > >>> > Then, the driver calls request_irq to actually start using the irq, >>> > which calls __setup_irq to irq_startup to startup_pirq. The >>> > startup_pirq call actually creates the evtchn and binds it to the >>> > previously allocated pirq via EVTCHNOP_bind_pirq. >>> > >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >>> > and is in use. But then, when the driver doesn't want it anymore, it >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >>> > function closes the evtchn via EVTCHNOP_close. >>> > >>> > Inside the hypervisor, in xen/common/event_channel.c in >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >>> > channel is) then it unmaps the pirq mapping via >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >>> > give away to anyone requesting a new pirq! >>> > >>> > >>> > >>> > >>> > >>> > > >>> > > -boris >>> > > >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >>> > >>> it finds the vector was previously configured with a pirq, and that pirq >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >>> > >>> to a different pci device. >>> > >> Hm, but this implies that we do keep track of it. >>> > >> >>> > >> >>> > >> while (true) >>> > >> do >>> > >> rmmod nvme >>> > >> modprobe nvme >>> > >> done >>> > >> >>> > >> Things go boom without this patch. But with this patch does this >>> > >> still work? As in we don't run out of PIRQs? >>> > >> >>> > >> Thanks. >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe >>> > >>> controllers (pci devices) are present. The NVMe driver configures each >>> > >>> pci device's msi/msix twice; first to configure a single vector (to >>> > >>> talk to the controller for its configuration info), and then it disables >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >>> > >>> multiple NVMe controllers are present, this happens concurrently on all >>> > >>> of them, and in the time between controller A calling pci_disable_msix() >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >>> > >>> controller A re-configures its msix, its first vector tries to re-use >>> > >>> the same pirq that it had before; but that pirq was allocated to >>> > >>> controller B, and thus the Xen event channel for controller A's re-used >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >>> > >>> pirq mapped elsewhere. >>> > >>> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>> > >>> --- >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >>> > >>> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >>> > >>> index bedfab9..a00a6c0 100644 >>> > >>> --- a/arch/x86/pci/xen.c >>> > >>> +++ b/arch/x86/pci/xen.c >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >>> > >>> return 1; >>> > >>> >>> > >>> for_each_pci_msi_entry(msidesc, dev) { >>> > >>> - __pci_read_msi_msg(msidesc, &msg); >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >>> > >>> - xen_irq_from_pirq(pirq) < 0) { >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >>> > >>> - if (pirq < 0) { >>> > >>> - irq = -ENODEV; >>> > >>> - goto error; >>> > >>> - } >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >>> > >>> - __pci_write_msi_msg(msidesc, &msg); >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> > >>> - } else { >>> > >>> - dev_dbg(&dev->dev, >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >>> > >>> + if (pirq < 0) { >>> > >>> + irq = -ENODEV; >>> > >>> + goto error; >>> > >>> } >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >>> > >>> + __pci_write_msi_msg(msidesc, &msg); >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >>> > >>> (type == PCI_CAP_ID_MSIX) ? >>> > >>> -- >>> > >>> 2.9.3 >>> > >>> >>> > > >>> > > >>> > > _______________________________________________ >>> > > Xen-devel mailing list >>> > > Xen-devel@lists.xen.org >>> > > https://lists.xen.org/xen-devel >>> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-10 18:41 ` Dan Streetman @ 2017-01-10 19:03 ` Stefano Stabellini 2017-01-10 21:32 ` Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-01-10 19:03 UTC (permalink / raw) To: Dan Streetman Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, 10 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > > <sstabellini@kernel.org> wrote: > >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >>> > <boris.ostrovsky@oracle.com> wrote: > >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >>> > >>> unmapped and may now be in use by another pci device. The previous > >>> > >>> pirq should never be re-used; instead a new pirq should always be > >>> > >>> allocated from the hypervisor. > >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >>> > >> as we are not reusing them? > >>> > > > >>> > > Don't we free the pirq when we unmap it? > >>> > > >>> > I think this is actually a bit worse than I initially thought. After > >>> > looking a bit closer, and I think there's an asymmetry with pirq > >>> > allocation: > >>> > >>> Lets include Stefano, > >>> > >>> Thank you for digging in this! This has quite the deja-vu > >>> feeling as I believe I hit this at some point in the past and > >>> posted some possible ways of fixing this. But sadly I can't > >>> find the thread. > >> > >> This issue seems to be caused by: > >> > >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> > >> xen: fix MSI setup and teardown for PV on HVM guests > >> > >> which was a fix to a bug: > >> > >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> trying to enable the same MSI for the second time: the old MSI to pirq > >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> try to assign a new pirq anyway. > >> A simple way to reproduce this bug is to assign an MSI capable network > >> card to a PV on HVM guest, if the user brings down the corresponding > >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> device. > >> > >> I don't remember any of the details. From the description of this bug, > >> it seems that Xen changed behavior in the past few years: before it used > >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> have been completely freed, then reassigned to somebody else the way it > >> is described in this email. > >> > >> I think we should indentify the changeset or Xen version that introduced > >> the new behavior. If it is old enough, we might be able to just revert > >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> behavior conditional to the Xen version. > > > > Are PT devices the only MSI-capable devices available in a Xen guest? > > That's where I'm seeing this problem, with PT NVMe devices. They are the main ones. It is possible to have emulated virtio devices with emulated MSIs, for example virtio-net. Althought they are not in the Xen Project CI-loop, so I wouldn't be surprised if they are broken too. > > I can say that on the Xen guest with NVMe PT devices I'm testing on, > > with the patch from this thread (which essentially reverts your commit > > above) as well as some added debug to see the pirq numbers, cycles of > > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > > hypervisor provides essentially the same pirqs for each modprobe, > > since they were freed by the rmmod. I am fine with reverting the old patch, but we need to understand what caused the change in behavior first. Maybe somebody else with a Xen PCI passthrough setup at hand can help with that. In the Xen code I can still see: case ECS_PIRQ: { struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); if ( !pirq ) break; if ( !is_hvm_domain(d1) ) pirq_guest_unbind(d1, pirq); which means that pirq_guest_unbind should only be called on evtchn_close if the guest is not an HVM guest. > >>> > tl;dr: > >>> > > >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq > >>> > allocated, and reserved in the hypervisor > >>> > > >>> > request_irq() -> an event channel is opened for the specific pirq, and > >>> > maps the pirq with the hypervisor > >>> > > >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, > >>> > but that unmap function also frees the pirq! The hypervisor can/will > >>> > give it away to the next call to get_free_pirq. However, the pci > >>> > msi/msix data area still contains the pirq number, and the next call > >>> > to request_irq() will re-use the pirq. > >>> > > >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor > >>> > (it's already been freed), and it also doesn't clear anything from the > >>> > msi/msix data area, so the pirq is still cached there. > >>> > > >>> > > >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq > >>> > when the event channel is closed - or at least, not to change it to > >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually > >>> > call into something in the Xen guest kernel that actually does the > >>> > pirq unmapping, and clear it from the msi data area (i.e. > >>> > pci_write_msi_msg) > >>> > >>> The problem is that Xen changes have sailed a long long time ago. > >>> > > >>> > Alternately, if the hypervisor doesn't change, then the call into the > >>> > hypervisor to actually allocate a pirq needs to move from the > >>> > pci_enable_msix_range() call to the request_irq() call? So that when > >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's > >>> > for any of the vectors; each request_irq/free_irq pair do the pirq > >>> > allocate-and-map/unmap... > >>> > >>> > >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() > >>> from actually calling evtchn_close() until the pci_disable_msix() is done? > >> > >> I think that's a reasonable alternative: we mask the evtchn, but do not > >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. > >> Something like (not compiled, untested): > >> > >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c > >> index 137bd0e..3174923 100644 > >> --- a/drivers/xen/events/events_base.c > >> +++ b/drivers/xen/events/events_base.c > >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) > >> return; > >> > >> mask_evtchn(evtchn); > >> - xen_evtchn_close(evtchn); > >> + if (!xen_hvm_domain()) > >> + xen_evtchn_close(evtchn); > > > > wouldn't we need to also add code to the pci_disable_msix path that > > would actually close the evtchn? Would this leave the evtchn around > > forever? > > > >> xen_irq_info_cleanup(info); > >> } > >> > >> > >> We want to keep the pirq allocated to the device - not closing the > >> evtchn seems like the right thing to do. I suggest we test for the > >> original bug too: enable/disable the network interface of an MSI capable > >> network card. > > > > I don't have a Xen hypervisor setup myself, I'm just using AWS, but > > I'll try to test this on an instance with a SRIOV/PT nic. > > I started an instance with SRIOV nics, with the latest upstream kernel > plus this patch and debug to show the pirqs. Taking an interface down > and back up uses the same pirq, because the driver only does > free_irq/request_irq, which just closes/reopens the evtchn using the > same pirq (which is cached in the struct irq_info) - it doesn't > disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of > the driver does disable and reenable the MSIX vectors, but the > hypervisor provides the same pirqs from the get_free_pirq() call that > were used before (since nothing else is asking for free pirqs). Good! Thanks for testing, it's very helpful. I believe it should work even if the hypervisor returned a different pirq though. > >>> > longer details: > >>> > > >>> > The chain of function calls starts in the initial call to configure > >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or > >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which > >>> > either tries to re-use any cached pirq in the MSI data area, or (for > >>> > the first time setup) allocates a new pirq from the hypervisor via > >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the > >>> > hypervisor's perspective, but it's not mapped to anything in the guest > >>> > kernel. > >>> > > >>> > Then, the driver calls request_irq to actually start using the irq, > >>> > which calls __setup_irq to irq_startup to startup_pirq. The > >>> > startup_pirq call actually creates the evtchn and binds it to the > >>> > previously allocated pirq via EVTCHNOP_bind_pirq. > >>> > > >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) > >>> > and is in use. But then, when the driver doesn't want it anymore, it > >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that > >>> > function closes the evtchn via EVTCHNOP_close. > >>> > > >>> > Inside the hypervisor, in xen/common/event_channel.c in > >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq > >>> > channel is) then it unmaps the pirq mapping via > >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back > >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to > >>> > give away to anyone requesting a new pirq! > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > > >>> > > -boris > >>> > > > >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's > >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if > >>> > >>> it finds the vector was previously configured with a pirq, and that pirq > >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device > >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given > >>> > >>> to a different pci device. > >>> > >> Hm, but this implies that we do keep track of it. > >>> > >> > >>> > >> > >>> > >> while (true) > >>> > >> do > >>> > >> rmmod nvme > >>> > >> modprobe nvme > >>> > >> done > >>> > >> > >>> > >> Things go boom without this patch. But with this patch does this > >>> > >> still work? As in we don't run out of PIRQs? > >>> > >> > >>> > >> Thanks. > >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe > >>> > >>> controllers (pci devices) are present. The NVMe driver configures each > >>> > >>> pci device's msi/msix twice; first to configure a single vector (to > >>> > >>> talk to the controller for its configuration info), and then it disables > >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When > >>> > >>> multiple NVMe controllers are present, this happens concurrently on all > >>> > >>> of them, and in the time between controller A calling pci_disable_msix() > >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix > >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when > >>> > >>> controller A re-configures its msix, its first vector tries to re-use > >>> > >>> the same pirq that it had before; but that pirq was allocated to > >>> > >>> controller B, and thus the Xen event channel for controller A's re-used > >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the > >>> > >>> pirq mapped elsewhere. > >>> > >>> > >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > >>> > >>> --- > >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- > >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) > >>> > >>> > >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > >>> > >>> index bedfab9..a00a6c0 100644 > >>> > >>> --- a/arch/x86/pci/xen.c > >>> > >>> +++ b/arch/x86/pci/xen.c > >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > >>> > >>> return 1; > >>> > >>> > >>> > >>> for_each_pci_msi_entry(msidesc, dev) { > >>> > >>> - __pci_read_msi_msg(msidesc, &msg); > >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || > >>> > >>> - xen_irq_from_pirq(pirq) < 0) { > >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> > >>> - if (pirq < 0) { > >>> > >>> - irq = -ENODEV; > >>> > >>> - goto error; > >>> > >>> - } > >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); > >>> > >>> - __pci_write_msi_msg(msidesc, &msg); > >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> > >>> - } else { > >>> > >>> - dev_dbg(&dev->dev, > >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); > >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> > >>> + if (pirq < 0) { > >>> > >>> + irq = -ENODEV; > >>> > >>> + goto error; > >>> > >>> } > >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); > >>> > >>> + __pci_write_msi_msg(msidesc, &msg); > >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, > >>> > >>> (type == PCI_CAP_ID_MSIX) ? > >>> > >>> -- > >>> > >>> 2.9.3 > >>> > >>> > >>> > > > >>> > > > >>> > > _______________________________________________ > >>> > > Xen-devel mailing list > >>> > > Xen-devel@lists.xen.org > >>> > > https://lists.xen.org/xen-devel > >>> > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-10 19:03 ` Stefano Stabellini @ 2017-01-10 21:32 ` Dan Streetman 2017-01-10 23:28 ` Boris Ostrovsky 2017-01-11 1:25 ` Stefano Stabellini 0 siblings, 2 replies; 34+ messages in thread From: Dan Streetman @ 2017-01-10 21:32 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Tue, 10 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> > <sstabellini@kernel.org> wrote: >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >>> > >>> allocated from the hypervisor. >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >>> > >> as we are not reusing them? >> >>> > > >> >>> > > Don't we free the pirq when we unmap it? >> >>> > >> >>> > I think this is actually a bit worse than I initially thought. After >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >>> > allocation: >> >>> >> >>> Lets include Stefano, >> >>> >> >>> Thank you for digging in this! This has quite the deja-vu >> >>> feeling as I believe I hit this at some point in the past and >> >>> posted some possible ways of fixing this. But sadly I can't >> >>> find the thread. >> >> >> >> This issue seems to be caused by: >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> which was a fix to a bug: >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> try to assign a new pirq anyway. >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> device. >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> it seems that Xen changed behavior in the past few years: before it used >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> have been completely freed, then reassigned to somebody else the way it >> >> is described in this email. >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> the new behavior. If it is old enough, we might be able to just revert >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> behavior conditional to the Xen version. >> > >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> > That's where I'm seeing this problem, with PT NVMe devices. > > They are the main ones. It is possible to have emulated virtio devices > with emulated MSIs, for example virtio-net. Althought they are not in > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > too. > > >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> > with the patch from this thread (which essentially reverts your commit >> > above) as well as some added debug to see the pirq numbers, cycles of >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> > hypervisor provides essentially the same pirqs for each modprobe, >> > since they were freed by the rmmod. > > I am fine with reverting the old patch, but we need to understand what > caused the change in behavior first. Maybe somebody else with a Xen PCI > passthrough setup at hand can help with that. > > In the Xen code I can still see: > > case ECS_PIRQ: { > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > > if ( !pirq ) > break; > if ( !is_hvm_domain(d1) ) > pirq_guest_unbind(d1, pirq); > > which means that pirq_guest_unbind should only be called on evtchn_close > if the guest is not an HVM guest. I tried an experiment to call get_free_pirq on both sides of a evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I see something interesting; each nic uses 3 MSIs, and it looks like when they shut down, each nic's 3 pirqs are not available until after the nic is done shutting down, so it seems like closing the evtchn isn't what makes the pirq free. [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == 0, xen_pvh_domain() == 0 just to be sure, a bit of dbg so I know what domain this is :-) [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but none of those become available for get_free_pirq. [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 aha, now nic 4 has fully finished shutting down, and nic 3 has started shutdown; we see those pirqs from nic 4 are now available. So it must not be evtchn closing that frees the pirqs. [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 so, since pci_disable_msix eventually calls xen_teardown_msi_irq() which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() (and recompiled/rebooted) and found the pirqs have already been freed before that is called: [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 I'm still following thru the kernel code, but it's not immediately obvious what exactly is telling the hypervisor to free the pirqs; any idea? >From the hypervisor code, it seems that the pirq is "available" via is_free_pirq(): return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND)); when the evtchn is closed, it does: if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) unmap_domain_pirq_emuirq(d1, pirq->pirq); and that call to unmap_domain_pirq_emuirq does: info->arch.hvm.emuirq = IRQ_UNBOUND; so, the only thing left is to clear pirq->arch.irq,but the only place I can find that does that is clear_domain_irq_pirq(), which is only called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not seeing where either of those would be called when all the kernel is doing is disabling a pci device. > > >> >>> > tl;dr: >> >>> > >> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >> >>> > allocated, and reserved in the hypervisor >> >>> > >> >>> > request_irq() -> an event channel is opened for the specific pirq, and >> >>> > maps the pirq with the hypervisor >> >>> > >> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, >> >>> > but that unmap function also frees the pirq! The hypervisor can/will >> >>> > give it away to the next call to get_free_pirq. However, the pci >> >>> > msi/msix data area still contains the pirq number, and the next call >> >>> > to request_irq() will re-use the pirq. >> >>> > >> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >> >>> > (it's already been freed), and it also doesn't clear anything from the >> >>> > msi/msix data area, so the pirq is still cached there. >> >>> > >> >>> > >> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >> >>> > when the event channel is closed - or at least, not to change it to >> >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >> >>> > call into something in the Xen guest kernel that actually does the >> >>> > pirq unmapping, and clear it from the msi data area (i.e. >> >>> > pci_write_msi_msg) >> >>> >> >>> The problem is that Xen changes have sailed a long long time ago. >> >>> > >> >>> > Alternately, if the hypervisor doesn't change, then the call into the >> >>> > hypervisor to actually allocate a pirq needs to move from the >> >>> > pci_enable_msix_range() call to the request_irq() call? So that when >> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq >> >>> > allocate-and-map/unmap... >> >>> >> >>> >> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >> >>> from actually calling evtchn_close() until the pci_disable_msix() is done? >> >> >> >> I think that's a reasonable alternative: we mask the evtchn, but do not >> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. >> >> Something like (not compiled, untested): >> >> >> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c >> >> index 137bd0e..3174923 100644 >> >> --- a/drivers/xen/events/events_base.c >> >> +++ b/drivers/xen/events/events_base.c >> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) >> >> return; >> >> >> >> mask_evtchn(evtchn); >> >> - xen_evtchn_close(evtchn); >> >> + if (!xen_hvm_domain()) >> >> + xen_evtchn_close(evtchn); >> > >> > wouldn't we need to also add code to the pci_disable_msix path that >> > would actually close the evtchn? Would this leave the evtchn around >> > forever? >> > >> >> xen_irq_info_cleanup(info); >> >> } >> >> >> >> >> >> We want to keep the pirq allocated to the device - not closing the >> >> evtchn seems like the right thing to do. I suggest we test for the >> >> original bug too: enable/disable the network interface of an MSI capable >> >> network card. >> > >> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but >> > I'll try to test this on an instance with a SRIOV/PT nic. >> >> I started an instance with SRIOV nics, with the latest upstream kernel >> plus this patch and debug to show the pirqs. Taking an interface down >> and back up uses the same pirq, because the driver only does >> free_irq/request_irq, which just closes/reopens the evtchn using the >> same pirq (which is cached in the struct irq_info) - it doesn't >> disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of >> the driver does disable and reenable the MSIX vectors, but the >> hypervisor provides the same pirqs from the get_free_pirq() call that >> were used before (since nothing else is asking for free pirqs). > > Good! Thanks for testing, it's very helpful. I believe it should work > even if the hypervisor returned a different pirq though. > > >> >>> > longer details: >> >>> > >> >>> > The chain of function calls starts in the initial call to configure >> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or >> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >> >>> > either tries to re-use any cached pirq in the MSI data area, or (for >> >>> > the first time setup) allocates a new pirq from the hypervisor via >> >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >> >>> > hypervisor's perspective, but it's not mapped to anything in the guest >> >>> > kernel. >> >>> > >> >>> > Then, the driver calls request_irq to actually start using the irq, >> >>> > which calls __setup_irq to irq_startup to startup_pirq. The >> >>> > startup_pirq call actually creates the evtchn and binds it to the >> >>> > previously allocated pirq via EVTCHNOP_bind_pirq. >> >>> > >> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >> >>> > and is in use. But then, when the driver doesn't want it anymore, it >> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >> >>> > function closes the evtchn via EVTCHNOP_close. >> >>> > >> >>> > Inside the hypervisor, in xen/common/event_channel.c in >> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >> >>> > channel is) then it unmaps the pirq mapping via >> >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >> >>> > give away to anyone requesting a new pirq! >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > > >> >>> > > -boris >> >>> > > >> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq >> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >> >>> > >>> to a different pci device. >> >>> > >> Hm, but this implies that we do keep track of it. >> >>> > >> >> >>> > >> >> >>> > >> while (true) >> >>> > >> do >> >>> > >> rmmod nvme >> >>> > >> modprobe nvme >> >>> > >> done >> >>> > >> >> >>> > >> Things go boom without this patch. But with this patch does this >> >>> > >> still work? As in we don't run out of PIRQs? >> >>> > >> >> >>> > >> Thanks. >> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe >> >>> > >>> controllers (pci devices) are present. The NVMe driver configures each >> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to >> >>> > >>> talk to the controller for its configuration info), and then it disables >> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all >> >>> > >>> of them, and in the time between controller A calling pci_disable_msix() >> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >> >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >> >>> > >>> controller A re-configures its msix, its first vector tries to re-use >> >>> > >>> the same pirq that it had before; but that pirq was allocated to >> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used >> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >> >>> > >>> pirq mapped elsewhere. >> >>> > >>> >> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> >>> > >>> --- >> >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >> >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >> >>> > >>> >> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> >>> > >>> index bedfab9..a00a6c0 100644 >> >>> > >>> --- a/arch/x86/pci/xen.c >> >>> > >>> +++ b/arch/x86/pci/xen.c >> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> >>> > >>> return 1; >> >>> > >>> >> >>> > >>> for_each_pci_msi_entry(msidesc, dev) { >> >>> > >>> - __pci_read_msi_msg(msidesc, &msg); >> >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >> >>> > >>> - xen_irq_from_pirq(pirq) < 0) { >> >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> - if (pirq < 0) { >> >>> > >>> - irq = -ENODEV; >> >>> > >>> - goto error; >> >>> > >>> - } >> >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> - __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> - } else { >> >>> > >>> - dev_dbg(&dev->dev, >> >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >> >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> + if (pirq < 0) { >> >>> > >>> + irq = -ENODEV; >> >>> > >>> + goto error; >> >>> > >>> } >> >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> + __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> >>> > >>> (type == PCI_CAP_ID_MSIX) ? >> >>> > >>> -- >> >>> > >>> 2.9.3 >> >>> > >>> >> >>> > > >> >>> > > >> >>> > > _______________________________________________ >> >>> > > Xen-devel mailing list >> >>> > > Xen-devel@lists.xen.org >> >>> > > https://lists.xen.org/xen-devel >> >>> >> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-10 21:32 ` Dan Streetman @ 2017-01-10 23:28 ` Boris Ostrovsky 2017-01-11 1:25 ` Stefano Stabellini 1 sibling, 0 replies; 34+ messages in thread From: Boris Ostrovsky @ 2017-01-10 23:28 UTC (permalink / raw) To: Dan Streetman, Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci >> In the Xen code I can still see: >> >> case ECS_PIRQ: { >> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> >> if ( !pirq ) >> break; >> if ( !is_hvm_domain(d1) ) >> pirq_guest_unbind(d1, pirq); >> >> which means that pirq_guest_unbind should only be called on evtchn_close >> if the guest is not an HVM guest. A few lines further down we have: pirq->evtchn = 0; pirq_cleanup_check(pirq, d1); unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]); #ifdef CONFIG_X86 if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) unmap_domain_pirq_emuirq(d1, pirq->pirq); #endif which is where we free up the pirq. -boris ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-10 21:32 ` Dan Streetman 2017-01-10 23:28 ` Boris Ostrovsky @ 2017-01-11 1:25 ` Stefano Stabellini 2017-01-11 15:26 ` Dan Streetman 1 sibling, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-01-11 1:25 UTC (permalink / raw) To: Dan Streetman Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, 10 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: > > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >> > <sstabellini@kernel.org> wrote: > >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >> >>> > <boris.ostrovsky@oracle.com> wrote: > >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >> >>> > >>> allocated from the hypervisor. > >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> >>> > >> as we are not reusing them? > >> >>> > > > >> >>> > > Don't we free the pirq when we unmap it? > >> >>> > > >> >>> > I think this is actually a bit worse than I initially thought. After > >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >> >>> > allocation: > >> >>> > >> >>> Lets include Stefano, > >> >>> > >> >>> Thank you for digging in this! This has quite the deja-vu > >> >>> feeling as I believe I hit this at some point in the past and > >> >>> posted some possible ways of fixing this. But sadly I can't > >> >>> find the thread. > >> >> > >> >> This issue seems to be caused by: > >> >> > >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> >> > >> >> xen: fix MSI setup and teardown for PV on HVM guests > >> >> > >> >> which was a fix to a bug: > >> >> > >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> >> try to assign a new pirq anyway. > >> >> A simple way to reproduce this bug is to assign an MSI capable network > >> >> card to a PV on HVM guest, if the user brings down the corresponding > >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> >> device. > >> >> > >> >> I don't remember any of the details. From the description of this bug, > >> >> it seems that Xen changed behavior in the past few years: before it used > >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> >> have been completely freed, then reassigned to somebody else the way it > >> >> is described in this email. > >> >> > >> >> I think we should indentify the changeset or Xen version that introduced > >> >> the new behavior. If it is old enough, we might be able to just revert > >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> >> behavior conditional to the Xen version. > >> > > >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >> > That's where I'm seeing this problem, with PT NVMe devices. > > > > They are the main ones. It is possible to have emulated virtio devices > > with emulated MSIs, for example virtio-net. Althought they are not in > > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > > too. > > > > > >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >> > with the patch from this thread (which essentially reverts your commit > >> > above) as well as some added debug to see the pirq numbers, cycles of > >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >> > hypervisor provides essentially the same pirqs for each modprobe, > >> > since they were freed by the rmmod. > > > > I am fine with reverting the old patch, but we need to understand what > > caused the change in behavior first. Maybe somebody else with a Xen PCI > > passthrough setup at hand can help with that. > > > > In the Xen code I can still see: > > > > case ECS_PIRQ: { > > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > > > > if ( !pirq ) > > break; > > if ( !is_hvm_domain(d1) ) > > pirq_guest_unbind(d1, pirq); > > > > which means that pirq_guest_unbind should only be called on evtchn_close > > if the guest is not an HVM guest. > > I tried an experiment to call get_free_pirq on both sides of a > evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > see something interesting; each nic uses 3 MSIs, and it looks like > when they shut down, each nic's 3 pirqs are not available until after > the nic is done shutting down, so it seems like closing the evtchn > isn't what makes the pirq free. > > [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > > nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > > [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > 0, xen_pvh_domain() == 0 > > just to be sure, a bit of dbg so I know what domain this is :-) > > [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > > nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > none of those become available for get_free_pirq. > > [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > > aha, now nic 4 has fully finished shutting down, and nic 3 has started > shutdown; we see those pirqs from nic 4 are now available. So it must > not be evtchn closing that frees the pirqs. > > [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > > > so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > (and recompiled/rebooted) and found the pirqs have already been freed > before that is called: > > [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > > > I'm still following thru the kernel code, but it's not immediately > obvious what exactly is telling the hypervisor to free the pirqs; any > idea? > > >From the hypervisor code, it seems that the pirq is "available" via > is_free_pirq(): > return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > > when the evtchn is closed, it does: > if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > unmap_domain_pirq_emuirq(d1, pirq->pirq); > > and that call to unmap_domain_pirq_emuirq does: > info->arch.hvm.emuirq = IRQ_UNBOUND; > > so, the only thing left is to clear pirq->arch.irq,but the only place > I can find that does that is clear_domain_irq_pirq(), which is only > called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > seeing where either of those would be called when all the kernel is > doing is disabling a pci device. Thanks for the info. I think I know what causes the pirq to be unmapped: when Linux disables msi or msix on the device, using the regular pci config space based method, QEMU (which emulates the PCI config space) tells Xen to unmap the pirq. If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs to be called a second time without msis being disabled first, then I think we can revert the patch. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-11 1:25 ` Stefano Stabellini @ 2017-01-11 15:26 ` Dan Streetman 2017-01-11 18:46 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-11 15:26 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Tue, 10 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >> <sstabellini@kernel.org> wrote: >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> >> > <sstabellini@kernel.org> wrote: >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >> >>> > >>> allocated from the hypervisor. >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >> >>> > >> as we are not reusing them? >> >> >>> > > >> >> >>> > > Don't we free the pirq when we unmap it? >> >> >>> > >> >> >>> > I think this is actually a bit worse than I initially thought. After >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >> >>> > allocation: >> >> >>> >> >> >>> Lets include Stefano, >> >> >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu >> >> >>> feeling as I believe I hit this at some point in the past and >> >> >>> posted some possible ways of fixing this. But sadly I can't >> >> >>> find the thread. >> >> >> >> >> >> This issue seems to be caused by: >> >> >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> >> >> which was a fix to a bug: >> >> >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> >> try to assign a new pirq anyway. >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> >> device. >> >> >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> >> it seems that Xen changed behavior in the past few years: before it used >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> >> have been completely freed, then reassigned to somebody else the way it >> >> >> is described in this email. >> >> >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> >> the new behavior. If it is old enough, we might be able to just revert >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> >> behavior conditional to the Xen version. >> >> > >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> >> > That's where I'm seeing this problem, with PT NVMe devices. >> > >> > They are the main ones. It is possible to have emulated virtio devices >> > with emulated MSIs, for example virtio-net. Althought they are not in >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >> > too. >> > >> > >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> >> > with the patch from this thread (which essentially reverts your commit >> >> > above) as well as some added debug to see the pirq numbers, cycles of >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> >> > hypervisor provides essentially the same pirqs for each modprobe, >> >> > since they were freed by the rmmod. >> > >> > I am fine with reverting the old patch, but we need to understand what >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >> > passthrough setup at hand can help with that. >> > >> > In the Xen code I can still see: >> > >> > case ECS_PIRQ: { >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> > >> > if ( !pirq ) >> > break; >> > if ( !is_hvm_domain(d1) ) >> > pirq_guest_unbind(d1, pirq); >> > >> > which means that pirq_guest_unbind should only be called on evtchn_close >> > if the guest is not an HVM guest. >> >> I tried an experiment to call get_free_pirq on both sides of a >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >> see something interesting; each nic uses 3 MSIs, and it looks like >> when they shut down, each nic's 3 pirqs are not available until after >> the nic is done shutting down, so it seems like closing the evtchn >> isn't what makes the pirq free. >> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >> 0, xen_pvh_domain() == 0 >> >> just to be sure, a bit of dbg so I know what domain this is :-) >> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >> none of those become available for get_free_pirq. >> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >> shutdown; we see those pirqs from nic 4 are now available. So it must >> not be evtchn closing that frees the pirqs. >> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >> >> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >> (and recompiled/rebooted) and found the pirqs have already been freed >> before that is called: >> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >> >> >> I'm still following thru the kernel code, but it's not immediately >> obvious what exactly is telling the hypervisor to free the pirqs; any >> idea? >> >> >From the hypervisor code, it seems that the pirq is "available" via >> is_free_pirq(): >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >> >> when the evtchn is closed, it does: >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >> >> and that call to unmap_domain_pirq_emuirq does: >> info->arch.hvm.emuirq = IRQ_UNBOUND; >> >> so, the only thing left is to clear pirq->arch.irq,but the only place >> I can find that does that is clear_domain_irq_pirq(), which is only >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >> seeing where either of those would be called when all the kernel is >> doing is disabling a pci device. > > Thanks for the info. I think I know what causes the pirq to be unmapped: > when Linux disables msi or msix on the device, using the regular pci > config space based method, QEMU (which emulates the PCI config space) > tells Xen to unmap the pirq. aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > to be called a second time without msis being disabled first, then I > think we can revert the patch. It doesn't seem possible to call it twice from a correctly-behaved driver, but in case of a driver bug that does try to enable msi/msix multiple times without disabling, __pci_enable_msix() only does WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that should be changed to warn and also return error, to prevent re-configuring msi/msix if already configured? Or, maybe the warning is enough - the worst thing that reverting the patch does is use extra pirqs, right? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-11 15:26 ` Dan Streetman @ 2017-01-11 18:46 ` Stefano Stabellini 2017-01-11 23:25 ` Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-01-11 18:46 UTC (permalink / raw) To: Dan Streetman Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Wed, 11 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: > > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > >> <sstabellini@kernel.org> wrote: > >> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >> >> > <sstabellini@kernel.org> wrote: > >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >> >> >>> > <boris.ostrovsky@oracle.com> wrote: > >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >> >> >>> > >>> allocated from the hypervisor. > >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> >> >>> > >> as we are not reusing them? > >> >> >>> > > > >> >> >>> > > Don't we free the pirq when we unmap it? > >> >> >>> > > >> >> >>> > I think this is actually a bit worse than I initially thought. After > >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >> >> >>> > allocation: > >> >> >>> > >> >> >>> Lets include Stefano, > >> >> >>> > >> >> >>> Thank you for digging in this! This has quite the deja-vu > >> >> >>> feeling as I believe I hit this at some point in the past and > >> >> >>> posted some possible ways of fixing this. But sadly I can't > >> >> >>> find the thread. > >> >> >> > >> >> >> This issue seems to be caused by: > >> >> >> > >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> >> >> > >> >> >> xen: fix MSI setup and teardown for PV on HVM guests > >> >> >> > >> >> >> which was a fix to a bug: > >> >> >> > >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> >> >> try to assign a new pirq anyway. > >> >> >> A simple way to reproduce this bug is to assign an MSI capable network > >> >> >> card to a PV on HVM guest, if the user brings down the corresponding > >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> >> >> device. > >> >> >> > >> >> >> I don't remember any of the details. From the description of this bug, > >> >> >> it seems that Xen changed behavior in the past few years: before it used > >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> >> >> have been completely freed, then reassigned to somebody else the way it > >> >> >> is described in this email. > >> >> >> > >> >> >> I think we should indentify the changeset or Xen version that introduced > >> >> >> the new behavior. If it is old enough, we might be able to just revert > >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> >> >> behavior conditional to the Xen version. > >> >> > > >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >> >> > That's where I'm seeing this problem, with PT NVMe devices. > >> > > >> > They are the main ones. It is possible to have emulated virtio devices > >> > with emulated MSIs, for example virtio-net. Althought they are not in > >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > >> > too. > >> > > >> > > >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >> >> > with the patch from this thread (which essentially reverts your commit > >> >> > above) as well as some added debug to see the pirq numbers, cycles of > >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >> >> > hypervisor provides essentially the same pirqs for each modprobe, > >> >> > since they were freed by the rmmod. > >> > > >> > I am fine with reverting the old patch, but we need to understand what > >> > caused the change in behavior first. Maybe somebody else with a Xen PCI > >> > passthrough setup at hand can help with that. > >> > > >> > In the Xen code I can still see: > >> > > >> > case ECS_PIRQ: { > >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > >> > > >> > if ( !pirq ) > >> > break; > >> > if ( !is_hvm_domain(d1) ) > >> > pirq_guest_unbind(d1, pirq); > >> > > >> > which means that pirq_guest_unbind should only be called on evtchn_close > >> > if the guest is not an HVM guest. > >> > >> I tried an experiment to call get_free_pirq on both sides of a > >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > >> see something interesting; each nic uses 3 MSIs, and it looks like > >> when they shut down, each nic's 3 pirqs are not available until after > >> the nic is done shutting down, so it seems like closing the evtchn > >> isn't what makes the pirq free. > >> > >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > >> > >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > >> > >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > >> 0, xen_pvh_domain() == 0 > >> > >> just to be sure, a bit of dbg so I know what domain this is :-) > >> > >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > >> > >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > >> none of those become available for get_free_pirq. > >> > >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > >> > >> aha, now nic 4 has fully finished shutting down, and nic 3 has started > >> shutdown; we see those pirqs from nic 4 are now available. So it must > >> not be evtchn closing that frees the pirqs. > >> > >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > >> > >> > >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > >> (and recompiled/rebooted) and found the pirqs have already been freed > >> before that is called: > >> > >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > >> > >> > >> I'm still following thru the kernel code, but it's not immediately > >> obvious what exactly is telling the hypervisor to free the pirqs; any > >> idea? > >> > >> >From the hypervisor code, it seems that the pirq is "available" via > >> is_free_pirq(): > >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > >> > >> when the evtchn is closed, it does: > >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > >> unmap_domain_pirq_emuirq(d1, pirq->pirq); > >> > >> and that call to unmap_domain_pirq_emuirq does: > >> info->arch.hvm.emuirq = IRQ_UNBOUND; > >> > >> so, the only thing left is to clear pirq->arch.irq,but the only place > >> I can find that does that is clear_domain_irq_pirq(), which is only > >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > >> seeing where either of those would be called when all the kernel is > >> doing is disabling a pci device. > > > > Thanks for the info. I think I know what causes the pirq to be unmapped: > > when Linux disables msi or msix on the device, using the regular pci > > config space based method, QEMU (which emulates the PCI config space) > > tells Xen to unmap the pirq. > > aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > > > > > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > > to be called a second time without msis being disabled first, then I > > think we can revert the patch. > > It doesn't seem possible to call it twice from a correctly-behaved > driver, but in case of a driver bug that does try to enable msi/msix > multiple times without disabling, __pci_enable_msix() only does > WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does > WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that > should be changed to warn and also return error, to prevent > re-configuring msi/msix if already configured? Or, maybe the warning > is enough - the worst thing that reverting the patch does is use extra > pirqs, right? I think the warning is enough. Can you confirm that with af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also ifconfig eth0 down; ifconfig eth0 up still work as expected, no warnings? It looks like the patch that changed hypervisor (QEMU actually) behavior is: commit c976437c7dba9c7444fb41df45468968aaa326ad Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> Date: Wed May 7 13:41:48 2014 +0000 qemu-xen: free all the pirqs for msi/msix when driver unload >From this commit onward, QEMU started unmapping pirqs when MSIs are disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 and older. Given that Xen 4.4 is out of support, I think we should go ahead with it. Opinions? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-11 18:46 ` Stefano Stabellini @ 2017-01-11 23:25 ` Dan Streetman 2017-01-13 18:31 ` Dan Streetman 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-11 23:25 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Wed, 11 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >> <sstabellini@kernel.org> wrote: >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >> >> <sstabellini@kernel.org> wrote: >> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> >> >> > <sstabellini@kernel.org> wrote: >> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >> >> >>> > >>> allocated from the hypervisor. >> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >> >> >>> > >> as we are not reusing them? >> >> >> >>> > > >> >> >> >>> > > Don't we free the pirq when we unmap it? >> >> >> >>> > >> >> >> >>> > I think this is actually a bit worse than I initially thought. After >> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >> >> >>> > allocation: >> >> >> >>> >> >> >> >>> Lets include Stefano, >> >> >> >>> >> >> >> >>> Thank you for digging in this! This has quite the deja-vu >> >> >> >>> feeling as I believe I hit this at some point in the past and >> >> >> >>> posted some possible ways of fixing this. But sadly I can't >> >> >> >>> find the thread. >> >> >> >> >> >> >> >> This issue seems to be caused by: >> >> >> >> >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> >> >> >> >> which was a fix to a bug: >> >> >> >> >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> >> >> try to assign a new pirq anyway. >> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> >> >> device. >> >> >> >> >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> >> >> it seems that Xen changed behavior in the past few years: before it used >> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> >> >> have been completely freed, then reassigned to somebody else the way it >> >> >> >> is described in this email. >> >> >> >> >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> >> >> the new behavior. If it is old enough, we might be able to just revert >> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> >> >> behavior conditional to the Xen version. >> >> >> > >> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> >> >> > That's where I'm seeing this problem, with PT NVMe devices. >> >> > >> >> > They are the main ones. It is possible to have emulated virtio devices >> >> > with emulated MSIs, for example virtio-net. Althought they are not in >> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >> >> > too. >> >> > >> >> > >> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> >> >> > with the patch from this thread (which essentially reverts your commit >> >> >> > above) as well as some added debug to see the pirq numbers, cycles of >> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> >> >> > hypervisor provides essentially the same pirqs for each modprobe, >> >> >> > since they were freed by the rmmod. >> >> > >> >> > I am fine with reverting the old patch, but we need to understand what >> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >> >> > passthrough setup at hand can help with that. >> >> > >> >> > In the Xen code I can still see: >> >> > >> >> > case ECS_PIRQ: { >> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> >> > >> >> > if ( !pirq ) >> >> > break; >> >> > if ( !is_hvm_domain(d1) ) >> >> > pirq_guest_unbind(d1, pirq); >> >> > >> >> > which means that pirq_guest_unbind should only be called on evtchn_close >> >> > if the guest is not an HVM guest. >> >> >> >> I tried an experiment to call get_free_pirq on both sides of a >> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >> >> see something interesting; each nic uses 3 MSIs, and it looks like >> >> when they shut down, each nic's 3 pirqs are not available until after >> >> the nic is done shutting down, so it seems like closing the evtchn >> >> isn't what makes the pirq free. >> >> >> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >> >> >> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >> >> >> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >> >> 0, xen_pvh_domain() == 0 >> >> >> >> just to be sure, a bit of dbg so I know what domain this is :-) >> >> >> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >> >> >> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >> >> none of those become available for get_free_pirq. >> >> >> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >> >> >> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >> >> shutdown; we see those pirqs from nic 4 are now available. So it must >> >> not be evtchn closing that frees the pirqs. >> >> >> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >> >> >> >> >> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >> >> (and recompiled/rebooted) and found the pirqs have already been freed >> >> before that is called: >> >> >> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >> >> >> >> >> >> I'm still following thru the kernel code, but it's not immediately >> >> obvious what exactly is telling the hypervisor to free the pirqs; any >> >> idea? >> >> >> >> >From the hypervisor code, it seems that the pirq is "available" via >> >> is_free_pirq(): >> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >> >> >> >> when the evtchn is closed, it does: >> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >> >> >> >> and that call to unmap_domain_pirq_emuirq does: >> >> info->arch.hvm.emuirq = IRQ_UNBOUND; >> >> >> >> so, the only thing left is to clear pirq->arch.irq,but the only place >> >> I can find that does that is clear_domain_irq_pirq(), which is only >> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >> >> seeing where either of those would be called when all the kernel is >> >> doing is disabling a pci device. >> > >> > Thanks for the info. I think I know what causes the pirq to be unmapped: >> > when Linux disables msi or msix on the device, using the regular pci >> > config space based method, QEMU (which emulates the PCI config space) >> > tells Xen to unmap the pirq. >> >> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >> >> > >> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >> > to be called a second time without msis being disabled first, then I >> > think we can revert the patch. >> >> It doesn't seem possible to call it twice from a correctly-behaved >> driver, but in case of a driver bug that does try to enable msi/msix >> multiple times without disabling, __pci_enable_msix() only does >> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >> should be changed to warn and also return error, to prevent >> re-configuring msi/msix if already configured? Or, maybe the warning >> is enough - the worst thing that reverting the patch does is use extra >> pirqs, right? > > I think the warning is enough. Can you confirm that with > af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also > > ifconfig eth0 down; ifconfig eth0 up > > still work as expected, no warnings? yes, with the patch that started this thread - which essentially reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no warnings and ifconfig down ; ifconfig up work as expected. > > > It looks like the patch that changed hypervisor (QEMU actually) behavior > is: > > commit c976437c7dba9c7444fb41df45468968aaa326ad > Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> > Date: Wed May 7 13:41:48 2014 +0000 > > qemu-xen: free all the pirqs for msi/msix when driver unload > > From this commit onward, QEMU started unmapping pirqs when MSIs are > disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, > 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. > > If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on > all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 > and older. Given that Xen 4.4 is out of support, I think we should go > ahead with it. Opinions? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-11 23:25 ` Dan Streetman @ 2017-01-13 18:31 ` Dan Streetman 2017-01-13 18:44 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-01-13 18:31 UTC (permalink / raw) To: Stefano Stabellini Cc: Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: > On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: >> On Wed, 11 Jan 2017, Dan Streetman wrote: >>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>> <sstabellini@kernel.org> wrote: >>> > On Tue, 10 Jan 2017, Dan Streetman wrote: >>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>> >> <sstabellini@kernel.org> wrote: >>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>> >> >> > <sstabellini@kernel.org> wrote: >>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >>> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >>> >> >> >>> > >>> allocated from the hypervisor. >>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >>> >> >> >>> > >> as we are not reusing them? >>> >> >> >>> > > >>> >> >> >>> > > Don't we free the pirq when we unmap it? >>> >> >> >>> > >>> >> >> >>> > I think this is actually a bit worse than I initially thought. After >>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >>> >> >> >>> > allocation: >>> >> >> >>> >>> >> >> >>> Lets include Stefano, >>> >> >> >>> >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu >>> >> >> >>> feeling as I believe I hit this at some point in the past and >>> >> >> >>> posted some possible ways of fixing this. But sadly I can't >>> >> >> >>> find the thread. >>> >> >> >> >>> >> >> >> This issue seems to be caused by: >>> >> >> >> >>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >>> >> >> >> >>> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >>> >> >> >> >>> >> >> >> which was a fix to a bug: >>> >> >> >> >>> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >>> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>> >> >> >> try to assign a new pirq anyway. >>> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >>> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >>> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >>> >> >> >> device. >>> >> >> >> >>> >> >> >> I don't remember any of the details. From the description of this bug, >>> >> >> >> it seems that Xen changed behavior in the past few years: before it used >>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>> >> >> >> have been completely freed, then reassigned to somebody else the way it >>> >> >> >> is described in this email. >>> >> >> >> >>> >> >> >> I think we should indentify the changeset or Xen version that introduced >>> >> >> >> the new behavior. If it is old enough, we might be able to just revert >>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>> >> >> >> behavior conditional to the Xen version. >>> >> >> > >>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >>> >> >> > That's where I'm seeing this problem, with PT NVMe devices. >>> >> > >>> >> > They are the main ones. It is possible to have emulated virtio devices >>> >> > with emulated MSIs, for example virtio-net. Althought they are not in >>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>> >> > too. >>> >> > >>> >> > >>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >>> >> >> > with the patch from this thread (which essentially reverts your commit >>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of >>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>> >> >> > hypervisor provides essentially the same pirqs for each modprobe, >>> >> >> > since they were freed by the rmmod. >>> >> > >>> >> > I am fine with reverting the old patch, but we need to understand what >>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >>> >> > passthrough setup at hand can help with that. >>> >> > >>> >> > In the Xen code I can still see: >>> >> > >>> >> > case ECS_PIRQ: { >>> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>> >> > >>> >> > if ( !pirq ) >>> >> > break; >>> >> > if ( !is_hvm_domain(d1) ) >>> >> > pirq_guest_unbind(d1, pirq); >>> >> > >>> >> > which means that pirq_guest_unbind should only be called on evtchn_close >>> >> > if the guest is not an HVM guest. >>> >> >>> >> I tried an experiment to call get_free_pirq on both sides of a >>> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>> >> see something interesting; each nic uses 3 MSIs, and it looks like >>> >> when they shut down, each nic's 3 pirqs are not available until after >>> >> the nic is done shutting down, so it seems like closing the evtchn >>> >> isn't what makes the pirq free. >>> >> >>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>> >> >>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>> >> >>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>> >> 0, xen_pvh_domain() == 0 >>> >> >>> >> just to be sure, a bit of dbg so I know what domain this is :-) >>> >> >>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>> >> >>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>> >> none of those become available for get_free_pirq. >>> >> >>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>> >> >>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>> >> shutdown; we see those pirqs from nic 4 are now available. So it must >>> >> not be evtchn closing that frees the pirqs. >>> >> >>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>> >> >>> >> >>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>> >> (and recompiled/rebooted) and found the pirqs have already been freed >>> >> before that is called: >>> >> >>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>> >> >>> >> >>> >> I'm still following thru the kernel code, but it's not immediately >>> >> obvious what exactly is telling the hypervisor to free the pirqs; any >>> >> idea? >>> >> >>> >> >From the hypervisor code, it seems that the pirq is "available" via >>> >> is_free_pirq(): >>> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>> >> >>> >> when the evtchn is closed, it does: >>> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>> >> >>> >> and that call to unmap_domain_pirq_emuirq does: >>> >> info->arch.hvm.emuirq = IRQ_UNBOUND; >>> >> >>> >> so, the only thing left is to clear pirq->arch.irq,but the only place >>> >> I can find that does that is clear_domain_irq_pirq(), which is only >>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>> >> seeing where either of those would be called when all the kernel is >>> >> doing is disabling a pci device. >>> > >>> > Thanks for the info. I think I know what causes the pirq to be unmapped: >>> > when Linux disables msi or msix on the device, using the regular pci >>> > config space based method, QEMU (which emulates the PCI config space) >>> > tells Xen to unmap the pirq. >>> >>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>> >>> > >>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>> > to be called a second time without msis being disabled first, then I >>> > think we can revert the patch. >>> >>> It doesn't seem possible to call it twice from a correctly-behaved >>> driver, but in case of a driver bug that does try to enable msi/msix >>> multiple times without disabling, __pci_enable_msix() only does >>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>> should be changed to warn and also return error, to prevent >>> re-configuring msi/msix if already configured? Or, maybe the warning >>> is enough - the worst thing that reverting the patch does is use extra >>> pirqs, right? >> >> I think the warning is enough. Can you confirm that with >> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >> >> ifconfig eth0 down; ifconfig eth0 up >> >> still work as expected, no warnings? > > yes, with the patch that started this thread - which essentially > reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no > warnings and ifconfig down ; ifconfig up work as expected. > >> >> >> It looks like the patch that changed hypervisor (QEMU actually) behavior >> is: >> >> commit c976437c7dba9c7444fb41df45468968aaa326ad >> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >> Date: Wed May 7 13:41:48 2014 +0000 >> >> qemu-xen: free all the pirqs for msi/msix when driver unload >> >> From this commit onward, QEMU started unmapping pirqs when MSIs are >> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >> >> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >> and older. Given that Xen 4.4 is out of support, I think we should go >> ahead with it. Opinions? Looks like there's no complaints; is my patch from the start of this thread ok to use, or can you craft a patch to use? My patch's description could use updating to add some of the info/background from this discussion... ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 18:31 ` Dan Streetman @ 2017-01-13 18:44 ` Stefano Stabellini 2017-01-13 20:00 ` Boris Ostrovsky 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-01-13 18:44 UTC (permalink / raw) To: jgross, boris.ostrovsky Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci, ddstreet On Fri, 13 Jan 2017, Dan Streetman wrote: > On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: > > On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini > > <sstabellini@kernel.org> wrote: > >> On Wed, 11 Jan 2017, Dan Streetman wrote: > >>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini > >>> <sstabellini@kernel.org> wrote: > >>> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > >>> >> <sstabellini@kernel.org> wrote: > >>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >>> >> >> > <sstabellini@kernel.org> wrote: > >>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: > >>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >>> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >>> >> >> >>> > >>> allocated from the hypervisor. > >>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >>> >> >> >>> > >> as we are not reusing them? > >>> >> >> >>> > > > >>> >> >> >>> > > Don't we free the pirq when we unmap it? > >>> >> >> >>> > > >>> >> >> >>> > I think this is actually a bit worse than I initially thought. After > >>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >>> >> >> >>> > allocation: > >>> >> >> >>> > >>> >> >> >>> Lets include Stefano, > >>> >> >> >>> > >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu > >>> >> >> >>> feeling as I believe I hit this at some point in the past and > >>> >> >> >>> posted some possible ways of fixing this. But sadly I can't > >>> >> >> >>> find the thread. > >>> >> >> >> > >>> >> >> >> This issue seems to be caused by: > >>> >> >> >> > >>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >>> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >>> >> >> >> > >>> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests > >>> >> >> >> > >>> >> >> >> which was a fix to a bug: > >>> >> >> >> > >>> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >>> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >>> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >>> >> >> >> try to assign a new pirq anyway. > >>> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network > >>> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding > >>> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >>> >> >> >> device. > >>> >> >> >> > >>> >> >> >> I don't remember any of the details. From the description of this bug, > >>> >> >> >> it seems that Xen changed behavior in the past few years: before it used > >>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >>> >> >> >> have been completely freed, then reassigned to somebody else the way it > >>> >> >> >> is described in this email. > >>> >> >> >> > >>> >> >> >> I think we should indentify the changeset or Xen version that introduced > >>> >> >> >> the new behavior. If it is old enough, we might be able to just revert > >>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >>> >> >> >> behavior conditional to the Xen version. > >>> >> >> > > >>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >>> >> >> > That's where I'm seeing this problem, with PT NVMe devices. > >>> >> > > >>> >> > They are the main ones. It is possible to have emulated virtio devices > >>> >> > with emulated MSIs, for example virtio-net. Althought they are not in > >>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > >>> >> > too. > >>> >> > > >>> >> > > >>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >>> >> >> > with the patch from this thread (which essentially reverts your commit > >>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of > >>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >>> >> >> > hypervisor provides essentially the same pirqs for each modprobe, > >>> >> >> > since they were freed by the rmmod. > >>> >> > > >>> >> > I am fine with reverting the old patch, but we need to understand what > >>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI > >>> >> > passthrough setup at hand can help with that. > >>> >> > > >>> >> > In the Xen code I can still see: > >>> >> > > >>> >> > case ECS_PIRQ: { > >>> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > >>> >> > > >>> >> > if ( !pirq ) > >>> >> > break; > >>> >> > if ( !is_hvm_domain(d1) ) > >>> >> > pirq_guest_unbind(d1, pirq); > >>> >> > > >>> >> > which means that pirq_guest_unbind should only be called on evtchn_close > >>> >> > if the guest is not an HVM guest. > >>> >> > >>> >> I tried an experiment to call get_free_pirq on both sides of a > >>> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > >>> >> see something interesting; each nic uses 3 MSIs, and it looks like > >>> >> when they shut down, each nic's 3 pirqs are not available until after > >>> >> the nic is done shutting down, so it seems like closing the evtchn > >>> >> isn't what makes the pirq free. > >>> >> > >>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > >>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > >>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > >>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > >>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > >>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > >>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > >>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > >>> >> > >>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > >>> >> > >>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > >>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > >>> >> 0, xen_pvh_domain() == 0 > >>> >> > >>> >> just to be sure, a bit of dbg so I know what domain this is :-) > >>> >> > >>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > >>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > >>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > >>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > >>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > >>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > >>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > >>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > >>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > >>> >> > >>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > >>> >> none of those become available for get_free_pirq. > >>> >> > >>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > >>> >> > >>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started > >>> >> shutdown; we see those pirqs from nic 4 are now available. So it must > >>> >> not be evtchn closing that frees the pirqs. > >>> >> > >>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > >>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > >>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > >>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > >>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > >>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > >>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > >>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > >>> >> > >>> >> > >>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > >>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > >>> >> (and recompiled/rebooted) and found the pirqs have already been freed > >>> >> before that is called: > >>> >> > >>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > >>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > >>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > >>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > >>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > >>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > >>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > >>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > >>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > >>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > >>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > >>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > >>> >> > >>> >> > >>> >> I'm still following thru the kernel code, but it's not immediately > >>> >> obvious what exactly is telling the hypervisor to free the pirqs; any > >>> >> idea? > >>> >> > >>> >> >From the hypervisor code, it seems that the pirq is "available" via > >>> >> is_free_pirq(): > >>> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > >>> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > >>> >> > >>> >> when the evtchn is closed, it does: > >>> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > >>> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); > >>> >> > >>> >> and that call to unmap_domain_pirq_emuirq does: > >>> >> info->arch.hvm.emuirq = IRQ_UNBOUND; > >>> >> > >>> >> so, the only thing left is to clear pirq->arch.irq,but the only place > >>> >> I can find that does that is clear_domain_irq_pirq(), which is only > >>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > >>> >> seeing where either of those would be called when all the kernel is > >>> >> doing is disabling a pci device. > >>> > > >>> > Thanks for the info. I think I know what causes the pirq to be unmapped: > >>> > when Linux disables msi or msix on the device, using the regular pci > >>> > config space based method, QEMU (which emulates the PCI config space) > >>> > tells Xen to unmap the pirq. > >>> > >>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > >>> > >>> > > >>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > >>> > to be called a second time without msis being disabled first, then I > >>> > think we can revert the patch. > >>> > >>> It doesn't seem possible to call it twice from a correctly-behaved > >>> driver, but in case of a driver bug that does try to enable msi/msix > >>> multiple times without disabling, __pci_enable_msix() only does > >>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does > >>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that > >>> should be changed to warn and also return error, to prevent > >>> re-configuring msi/msix if already configured? Or, maybe the warning > >>> is enough - the worst thing that reverting the patch does is use extra > >>> pirqs, right? > >> > >> I think the warning is enough. Can you confirm that with > >> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also > >> > >> ifconfig eth0 down; ifconfig eth0 up > >> > >> still work as expected, no warnings? > > > > yes, with the patch that started this thread - which essentially > > reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no > > warnings and ifconfig down ; ifconfig up work as expected. > > > >> > >> > >> It looks like the patch that changed hypervisor (QEMU actually) behavior > >> is: > >> > >> commit c976437c7dba9c7444fb41df45468968aaa326ad > >> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> > >> Date: Wed May 7 13:41:48 2014 +0000 > >> > >> qemu-xen: free all the pirqs for msi/msix when driver unload > >> > >> From this commit onward, QEMU started unmapping pirqs when MSIs are > >> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, > >> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. > >> > >> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on > >> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 > >> and older. Given that Xen 4.4 is out of support, I think we should go > >> ahead with it. Opinions? > > Looks like there's no complaints; is my patch from the start of this > thread ok to use, or can you craft a patch to use? My patch's > description could use updating to add some of the info/background from > this discussion... Hi Dan, I would like an explicit Ack from the other maintainers, Boris and Juergen. Let me place them in To: to make it more obvious. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 18:44 ` Stefano Stabellini @ 2017-01-13 20:00 ` Boris Ostrovsky 2017-01-13 20:07 ` Dan Streetman 2017-01-13 20:13 ` Dan Streetman 0 siblings, 2 replies; 34+ messages in thread From: Boris Ostrovsky @ 2017-01-13 20:00 UTC (permalink / raw) To: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci, ddstreet On 01/13/2017 01:44 PM, Stefano Stabellini wrote: > On Fri, 13 Jan 2017, Dan Streetman wrote: >> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: >>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini >>> <sstabellini@kernel.org> wrote: >>>> On Wed, 11 Jan 2017, Dan Streetman wrote: >>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>>>> <sstabellini@kernel.org> wrote: >>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>>>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was >>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was >>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device. The previous >>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be >>>>>>>>>>>>>>>> allocated from the hypervisor. >>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs >>>>>>>>>>>>>>> as we are not reusing them? >>>>>>>>>>>>>> Don't we free the pirq when we unmap it? >>>>>>>>>>>>> I think this is actually a bit worse than I initially thought. After >>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq >>>>>>>>>>>>> allocation: >>>>>>>>>>>> Lets include Stefano, >>>>>>>>>>>> >>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu >>>>>>>>>>>> feeling as I believe I hit this at some point in the past and >>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't >>>>>>>>>>>> find the thread. >>>>>>>>>>> This issue seems to be caused by: >>>>>>>>>>> >>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>>>>>>>>>> Date: Wed Dec 1 14:51:44 2010 +0000 >>>>>>>>>>> >>>>>>>>>>> xen: fix MSI setup and teardown for PV on HVM guests >>>>>>>>>>> >>>>>>>>>>> which was a fix to a bug: >>>>>>>>>>> >>>>>>>>>>> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>>>>>>>>>> trying to enable the same MSI for the second time: the old MSI to pirq >>>>>>>>>>> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>>>>>>>>>> try to assign a new pirq anyway. >>>>>>>>>>> A simple way to reproduce this bug is to assign an MSI capable network >>>>>>>>>>> card to a PV on HVM guest, if the user brings down the corresponding >>>>>>>>>>> ethernet interface and up again, Linux would fail to enable MSIs on the >>>>>>>>>>> device. >>>>>>>>>>> >>>>>>>>>>> I don't remember any of the details. From the description of this bug, >>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used >>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it >>>>>>>>>>> is described in this email. >>>>>>>>>>> >>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced >>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert >>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>>>>>>>>>> behavior conditional to the Xen version. >>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest? >>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices. >>>>>>>> They are the main ones. It is possible to have emulated virtio devices >>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in >>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>>>>>>> too. >>>>>>>> >>>>>>>> >>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on, >>>>>>>>>> with the patch from this thread (which essentially reverts your commit >>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of >>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe, >>>>>>>>>> since they were freed by the rmmod. >>>>>>>> I am fine with reverting the old patch, but we need to understand what >>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI >>>>>>>> passthrough setup at hand can help with that. >>>>>>>> >>>>>>>> In the Xen code I can still see: >>>>>>>> >>>>>>>> case ECS_PIRQ: { >>>>>>>> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>>>>>>> >>>>>>>> if ( !pirq ) >>>>>>>> break; >>>>>>>> if ( !is_hvm_domain(d1) ) >>>>>>>> pirq_guest_unbind(d1, pirq); >>>>>>>> >>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close >>>>>>>> if the guest is not an HVM guest. >>>>>>> I tried an experiment to call get_free_pirq on both sides of a >>>>>>> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like >>>>>>> when they shut down, each nic's 3 pirqs are not available until after >>>>>>> the nic is done shutting down, so it seems like closing the evtchn >>>>>>> isn't what makes the pirq free. >>>>>>> >>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>>>>>> >>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>>>>>> >>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>>>>>> 0, xen_pvh_domain() == 0 >>>>>>> >>>>>>> just to be sure, a bit of dbg so I know what domain this is :-) >>>>>>> >>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>>>>>> >>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>>>>>> none of those become available for get_free_pirq. >>>>>>> >>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>>>>>> >>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>>>>>> shutdown; we see those pirqs from nic 4 are now available. So it must >>>>>>> not be evtchn closing that frees the pirqs. >>>>>>> >>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>>>>>> >>>>>>> >>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed >>>>>>> before that is called: >>>>>>> >>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>>>>>> >>>>>>> >>>>>>> I'm still following thru the kernel code, but it's not immediately >>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any >>>>>>> idea? >>>>>>> >>>>>>> >From the hypervisor code, it seems that the pirq is "available" via >>>>>>> is_free_pirq(): >>>>>>> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>>>>>> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>>>>>> >>>>>>> when the evtchn is closed, it does: >>>>>>> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>>>>>> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>>>>>> >>>>>>> and that call to unmap_domain_pirq_emuirq does: >>>>>>> info->arch.hvm.emuirq = IRQ_UNBOUND; >>>>>>> >>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place >>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only >>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>>>>>> seeing where either of those would be called when all the kernel is >>>>>>> doing is disabling a pci device. >>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped: >>>>>> when Linux disables msi or msix on the device, using the regular pci >>>>>> config space based method, QEMU (which emulates the PCI config space) >>>>>> tells Xen to unmap the pirq. >>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>>>> >>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>>>>> to be called a second time without msis being disabled first, then I >>>>>> think we can revert the patch. >>>>> It doesn't seem possible to call it twice from a correctly-behaved >>>>> driver, but in case of a driver bug that does try to enable msi/msix >>>>> multiple times without disabling, __pci_enable_msix() only does >>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>>>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>>>> should be changed to warn and also return error, to prevent >>>>> re-configuring msi/msix if already configured? Or, maybe the warning >>>>> is enough - the worst thing that reverting the patch does is use extra >>>>> pirqs, right? >>>> I think the warning is enough. Can you confirm that with >>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >>>> >>>> ifconfig eth0 down; ifconfig eth0 up >>>> >>>> still work as expected, no warnings? >>> yes, with the patch that started this thread - which essentially >>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no >>> warnings and ifconfig down ; ifconfig up work as expected. >>> >>>> >>>> It looks like the patch that changed hypervisor (QEMU actually) behavior >>>> is: >>>> >>>> commit c976437c7dba9c7444fb41df45468968aaa326ad >>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >>>> Date: Wed May 7 13:41:48 2014 +0000 >>>> >>>> qemu-xen: free all the pirqs for msi/msix when driver unload >>>> >>>> From this commit onward, QEMU started unmapping pirqs when MSIs are >>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >>>> >>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >>>> and older. Given that Xen 4.4 is out of support, I think we should go >>>> ahead with it. Opinions? >> Looks like there's no complaints; is my patch from the start of this >> thread ok to use, or can you craft a patch to use? My patch's >> description could use updating to add some of the info/background from >> this discussion... > Hi Dan, I would like an explicit Ack from the other maintainers, Boris > and Juergen. Let me place them in To: to make it more obvious. Where is the patch? I don't think 'git revert' will work. And Konrad will need to ack it too as he is Xen-PCI maintainer. -boris ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 20:00 ` Boris Ostrovsky @ 2017-01-13 20:07 ` Dan Streetman 2017-01-13 20:54 ` Stefano Stabellini ` (2 more replies) 2017-01-13 20:13 ` Dan Streetman 1 sibling, 3 replies; 34+ messages in thread From: Dan Streetman @ 2017-01-13 20:07 UTC (permalink / raw) To: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk, Boris Ostrovsky Cc: Dan Streetman, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci Revert the main part of commit: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") That commit introduced reading the pci device's msi message data to see if a pirq was previously configured for the device's msi/msix, and re-use that pirq. At the time, that was the correct behavior. However, a later change to Qemu caused it to call into the Xen hypervisor to unmap all pirqs for a pci device, when the pci device disables its MSI/MSIX vectors; specifically the Qemu commit: c976437c7dba9c7444fb41df45468968aaa326ad ("qemu-xen: free all the pirqs for msi/msix when driver unload") Once Qemu added this pirq unmapping, it was no longer correct for the kernel to re-use the pirq number cached in the pci device msi message data. All Qemu releases since 2.1.0 contain the patch that unmaps the pirqs when the pci device disables its MSI/MSIX vectors. This bug is causing failures to initialize multiple NVMe controllers under Xen, because the NVMe driver sets up a single MSIX vector for each controller (concurrently), and then after using that to talk to the controller for some configuration data, it disables the single MSIX vector and re-configures all the MSIX vectors it needs. So the MSIX setup code tries to re-use the cached pirq from the first vector for each controller, but the hypervisor has already given away that pirq to another controller, and its initialization fails. This is discussed in more detail at: https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") Signed-off-by: Dan Streetman <dan.streetman@canonical.com> --- arch/x86/pci/xen.c | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index bedfab9..a00a6c0 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) return 1; for_each_pci_msi_entry(msidesc, dev) { - __pci_read_msi_msg(msidesc, &msg); - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); - if (msg.data != XEN_PIRQ_MSI_DATA || - xen_irq_from_pirq(pirq) < 0) { - pirq = xen_allocate_pirq_msi(dev, msidesc); - if (pirq < 0) { - irq = -ENODEV; - goto error; - } - xen_msi_compose_msg(dev, pirq, &msg); - __pci_write_msi_msg(msidesc, &msg); - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); - } else { - dev_dbg(&dev->dev, - "xen: msi already bound to pirq=%d\n", pirq); + pirq = xen_allocate_pirq_msi(dev, msidesc); + if (pirq < 0) { + irq = -ENODEV; + goto error; } + xen_msi_compose_msg(dev, pirq, &msg); + __pci_write_msi_msg(msidesc, &msg); + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, (type == PCI_CAP_ID_MSI) ? nvec : 1, (type == PCI_CAP_ID_MSIX) ? -- 2.9.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 20:07 ` Dan Streetman @ 2017-01-13 20:54 ` Stefano Stabellini 2017-01-13 21:49 ` Boris Ostrovsky 2017-01-13 22:30 ` Konrad Rzeszutek Wilk 2 siblings, 0 replies; 34+ messages in thread From: Stefano Stabellini @ 2017-01-13 20:54 UTC (permalink / raw) To: Dan Streetman Cc: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Fri, 13 Jan 2017, Dan Streetman wrote: > Revert the main part of commit: > af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > That commit introduced reading the pci device's msi message data to see > if a pirq was previously configured for the device's msi/msix, and re-use > that pirq. At the time, that was the correct behavior. However, a > later change to Qemu caused it to call into the Xen hypervisor to unmap > all pirqs for a pci device, when the pci device disables its MSI/MSIX > vectors; specifically the Qemu commit: > c976437c7dba9c7444fb41df45468968aaa326ad > ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > Once Qemu added this pirq unmapping, it was no longer correct for the > kernel to re-use the pirq number cached in the pci device msi message > data. All Qemu releases since 2.1.0 contain the patch that unmaps the > pirqs when the pci device disables its MSI/MSIX vectors. > > This bug is causing failures to initialize multiple NVMe controllers > under Xen, because the NVMe driver sets up a single MSIX vector for > each controller (concurrently), and then after using that to talk to > the controller for some configuration data, it disables the single MSIX > vector and re-configures all the MSIX vectors it needs. So the MSIX > setup code tries to re-use the cached pirq from the first vector > for each controller, but the hypervisor has already given away that > pirq to another controller, and its initialization fails. > > This is discussed in more detail at: > https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> > --- > arch/x86/pci/xen.c | 23 +++++++---------------- > 1 file changed, 7 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > index bedfab9..a00a6c0 100644 > --- a/arch/x86/pci/xen.c > +++ b/arch/x86/pci/xen.c > @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > return 1; > > for_each_pci_msi_entry(msidesc, dev) { > - __pci_read_msi_msg(msidesc, &msg); > - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > - if (msg.data != XEN_PIRQ_MSI_DATA || > - xen_irq_from_pirq(pirq) < 0) { > - pirq = xen_allocate_pirq_msi(dev, msidesc); > - if (pirq < 0) { > - irq = -ENODEV; > - goto error; > - } > - xen_msi_compose_msg(dev, pirq, &msg); > - __pci_write_msi_msg(msidesc, &msg); > - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > - } else { > - dev_dbg(&dev->dev, > - "xen: msi already bound to pirq=%d\n", pirq); > + pirq = xen_allocate_pirq_msi(dev, msidesc); > + if (pirq < 0) { > + irq = -ENODEV; > + goto error; > } > + xen_msi_compose_msg(dev, pirq, &msg); > + __pci_write_msi_msg(msidesc, &msg); > + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > (type == PCI_CAP_ID_MSI) ? nvec : 1, > (type == PCI_CAP_ID_MSIX) ? > -- > 2.9.3 > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 20:07 ` Dan Streetman 2017-01-13 20:54 ` Stefano Stabellini @ 2017-01-13 21:49 ` Boris Ostrovsky 2017-01-13 22:30 ` Konrad Rzeszutek Wilk 2 siblings, 0 replies; 34+ messages in thread From: Boris Ostrovsky @ 2017-01-13 21:49 UTC (permalink / raw) To: Dan Streetman, Stefano Stabellini, jgross, Konrad Rzeszutek Wilk Cc: Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On 01/13/2017 03:07 PM, Dan Streetman wrote: > Revert the main part of commit: > af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > That commit introduced reading the pci device's msi message data to see > if a pirq was previously configured for the device's msi/msix, and re-use > that pirq. At the time, that was the correct behavior. However, a > later change to Qemu caused it to call into the Xen hypervisor to unmap > all pirqs for a pci device, when the pci device disables its MSI/MSIX > vectors; specifically the Qemu commit: > c976437c7dba9c7444fb41df45468968aaa326ad > ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > Once Qemu added this pirq unmapping, it was no longer correct for the > kernel to re-use the pirq number cached in the pci device msi message > data. All Qemu releases since 2.1.0 contain the patch that unmaps the > pirqs when the pci device disables its MSI/MSIX vectors. > > This bug is causing failures to initialize multiple NVMe controllers > under Xen, because the NVMe driver sets up a single MSIX vector for > each controller (concurrently), and then after using that to talk to > the controller for some configuration data, it disables the single MSIX > vector and re-configures all the MSIX vectors it needs. So the MSIX > setup code tries to re-use the cached pirq from the first vector > for each controller, but the hypervisor has already given away that > pirq to another controller, and its initialization fails. > > This is discussed in more detail at: > https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > --- > arch/x86/pci/xen.c | 23 +++++++---------------- > 1 file changed, 7 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > index bedfab9..a00a6c0 100644 > --- a/arch/x86/pci/xen.c > +++ b/arch/x86/pci/xen.c > @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > return 1; > > for_each_pci_msi_entry(msidesc, dev) { > - __pci_read_msi_msg(msidesc, &msg); > - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > - if (msg.data != XEN_PIRQ_MSI_DATA || > - xen_irq_from_pirq(pirq) < 0) { > - pirq = xen_allocate_pirq_msi(dev, msidesc); > - if (pirq < 0) { > - irq = -ENODEV; > - goto error; > - } > - xen_msi_compose_msg(dev, pirq, &msg); > - __pci_write_msi_msg(msidesc, &msg); > - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > - } else { > - dev_dbg(&dev->dev, > - "xen: msi already bound to pirq=%d\n", pirq); > + pirq = xen_allocate_pirq_msi(dev, msidesc); > + if (pirq < 0) { > + irq = -ENODEV; > + goto error; > } > + xen_msi_compose_msg(dev, pirq, &msg); > + __pci_write_msi_msg(msidesc, &msg); > + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > (type == PCI_CAP_ID_MSI) ? nvec : 1, > (type == PCI_CAP_ID_MSIX) ? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 20:07 ` Dan Streetman 2017-01-13 20:54 ` Stefano Stabellini 2017-01-13 21:49 ` Boris Ostrovsky @ 2017-01-13 22:30 ` Konrad Rzeszutek Wilk 2017-02-21 15:31 ` Dan Streetman 2 siblings, 1 reply; 34+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-01-13 22:30 UTC (permalink / raw) To: Dan Streetman Cc: Stefano Stabellini, jgross, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > Revert the main part of commit: > af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > That commit introduced reading the pci device's msi message data to see > if a pirq was previously configured for the device's msi/msix, and re-use > that pirq. At the time, that was the correct behavior. However, a > later change to Qemu caused it to call into the Xen hypervisor to unmap > all pirqs for a pci device, when the pci device disables its MSI/MSIX > vectors; specifically the Qemu commit: > c976437c7dba9c7444fb41df45468968aaa326ad > ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > Once Qemu added this pirq unmapping, it was no longer correct for the > kernel to re-use the pirq number cached in the pci device msi message > data. All Qemu releases since 2.1.0 contain the patch that unmaps the > pirqs when the pci device disables its MSI/MSIX vectors. > > This bug is causing failures to initialize multiple NVMe controllers > under Xen, because the NVMe driver sets up a single MSIX vector for > each controller (concurrently), and then after using that to talk to > the controller for some configuration data, it disables the single MSIX > vector and re-configures all the MSIX vectors it needs. So the MSIX > setup code tries to re-use the cached pirq from the first vector > for each controller, but the hypervisor has already given away that > pirq to another controller, and its initialization fails. > > This is discussed in more detail at: > https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 22:30 ` Konrad Rzeszutek Wilk @ 2017-02-21 15:31 ` Dan Streetman 2017-02-21 15:45 ` Juergen Gross 0 siblings, 1 reply; 34+ messages in thread From: Dan Streetman @ 2017-02-21 15:31 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Stefano Stabellini, jgross, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: >> Revert the main part of commit: >> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >> >> That commit introduced reading the pci device's msi message data to see >> if a pirq was previously configured for the device's msi/msix, and re-use >> that pirq. At the time, that was the correct behavior. However, a >> later change to Qemu caused it to call into the Xen hypervisor to unmap >> all pirqs for a pci device, when the pci device disables its MSI/MSIX >> vectors; specifically the Qemu commit: >> c976437c7dba9c7444fb41df45468968aaa326ad >> ("qemu-xen: free all the pirqs for msi/msix when driver unload") >> >> Once Qemu added this pirq unmapping, it was no longer correct for the >> kernel to re-use the pirq number cached in the pci device msi message >> data. All Qemu releases since 2.1.0 contain the patch that unmaps the >> pirqs when the pci device disables its MSI/MSIX vectors. >> >> This bug is causing failures to initialize multiple NVMe controllers >> under Xen, because the NVMe driver sets up a single MSIX vector for >> each controller (concurrently), and then after using that to talk to >> the controller for some configuration data, it disables the single MSIX >> vector and re-configures all the MSIX vectors it needs. So the MSIX >> setup code tries to re-use the cached pirq from the first vector >> for each controller, but the hypervisor has already given away that >> pirq to another controller, and its initialization fails. >> >> This is discussed in more detail at: >> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html >> >> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> This doesn't seem to be applied yet, is it still waiting on another ack? Or maybe I'm looking at the wrong git tree... ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-02-21 15:31 ` Dan Streetman @ 2017-02-21 15:45 ` Juergen Gross 2017-02-21 15:58 ` Boris Ostrovsky 0 siblings, 1 reply; 34+ messages in thread From: Juergen Gross @ 2017-02-21 15:45 UTC (permalink / raw) To: Dan Streetman, Konrad Rzeszutek Wilk Cc: Stefano Stabellini, Boris Ostrovsky, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On 21/02/17 16:31, Dan Streetman wrote: > On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: >> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: >>> Revert the main part of commit: >>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>> >>> That commit introduced reading the pci device's msi message data to see >>> if a pirq was previously configured for the device's msi/msix, and re-use >>> that pirq. At the time, that was the correct behavior. However, a >>> later change to Qemu caused it to call into the Xen hypervisor to unmap >>> all pirqs for a pci device, when the pci device disables its MSI/MSIX >>> vectors; specifically the Qemu commit: >>> c976437c7dba9c7444fb41df45468968aaa326ad >>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") >>> >>> Once Qemu added this pirq unmapping, it was no longer correct for the >>> kernel to re-use the pirq number cached in the pci device msi message >>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the >>> pirqs when the pci device disables its MSI/MSIX vectors. >>> >>> This bug is causing failures to initialize multiple NVMe controllers >>> under Xen, because the NVMe driver sets up a single MSIX vector for >>> each controller (concurrently), and then after using that to talk to >>> the controller for some configuration data, it disables the single MSIX >>> vector and re-configures all the MSIX vectors it needs. So the MSIX >>> setup code tries to re-use the cached pirq from the first vector >>> for each controller, but the hypervisor has already given away that >>> pirq to another controller, and its initialization fails. >>> >>> This is discussed in more detail at: >>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html >>> >>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> >> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > This doesn't seem to be applied yet, is it still waiting on another > ack? Or maybe I'm looking at the wrong git tree... Am I wrong or shouldn't this go through the PCI tree? Konrad? Juergen ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-02-21 15:45 ` Juergen Gross @ 2017-02-21 15:58 ` Boris Ostrovsky 2017-02-22 14:28 ` Bjorn Helgaas 0 siblings, 1 reply; 34+ messages in thread From: Boris Ostrovsky @ 2017-02-21 15:58 UTC (permalink / raw) To: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk Cc: Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On 02/21/2017 10:45 AM, Juergen Gross wrote: > On 21/02/17 16:31, Dan Streetman wrote: >> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: >>>> Revert the main part of commit: >>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>> >>>> That commit introduced reading the pci device's msi message data to see >>>> if a pirq was previously configured for the device's msi/msix, and re-use >>>> that pirq. At the time, that was the correct behavior. However, a >>>> later change to Qemu caused it to call into the Xen hypervisor to unmap >>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX >>>> vectors; specifically the Qemu commit: >>>> c976437c7dba9c7444fb41df45468968aaa326ad >>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") >>>> >>>> Once Qemu added this pirq unmapping, it was no longer correct for the >>>> kernel to re-use the pirq number cached in the pci device msi message >>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the >>>> pirqs when the pci device disables its MSI/MSIX vectors. >>>> >>>> This bug is causing failures to initialize multiple NVMe controllers >>>> under Xen, because the NVMe driver sets up a single MSIX vector for >>>> each controller (concurrently), and then after using that to talk to >>>> the controller for some configuration data, it disables the single MSIX >>>> vector and re-configures all the MSIX vectors it needs. So the MSIX >>>> setup code tries to re-use the cached pirq from the first vector >>>> for each controller, but the hypervisor has already given away that >>>> pirq to another controller, and its initialization fails. >>>> >>>> This is discussed in more detail at: >>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html >>>> >>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> This doesn't seem to be applied yet, is it still waiting on another >> ack? Or maybe I'm looking at the wrong git tree... > Am I wrong or shouldn't this go through the PCI tree? Konrad? Konrad is away this week but since pull request for Xen tree just went out we should probably wait until rc1 anyway (unless something big comes up before that). -boris ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-02-21 15:58 ` Boris Ostrovsky @ 2017-02-22 14:28 ` Bjorn Helgaas 2017-02-22 15:14 ` Boris Ostrovsky 0 siblings, 1 reply; 34+ messages in thread From: Bjorn Helgaas @ 2017-02-22 14:28 UTC (permalink / raw) To: Boris Ostrovsky Cc: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk, Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > On 02/21/2017 10:45 AM, Juergen Gross wrote: > > On 21/02/17 16:31, Dan Streetman wrote: > >> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > >> <konrad.wilk@oracle.com> wrote: > >>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > >>>> Revert the main part of commit: > >>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > >>>> > >>>> That commit introduced reading the pci device's msi message data to see > >>>> if a pirq was previously configured for the device's msi/msix, and re-use > >>>> that pirq. At the time, that was the correct behavior. However, a > >>>> later change to Qemu caused it to call into the Xen hypervisor to unmap > >>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX > >>>> vectors; specifically the Qemu commit: > >>>> c976437c7dba9c7444fb41df45468968aaa326ad > >>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") > >>>> > >>>> Once Qemu added this pirq unmapping, it was no longer correct for the > >>>> kernel to re-use the pirq number cached in the pci device msi message > >>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the > >>>> pirqs when the pci device disables its MSI/MSIX vectors. > >>>> > >>>> This bug is causing failures to initialize multiple NVMe controllers > >>>> under Xen, because the NVMe driver sets up a single MSIX vector for > >>>> each controller (concurrently), and then after using that to talk to > >>>> the controller for some configuration data, it disables the single MSIX > >>>> vector and re-configures all the MSIX vectors it needs. So the MSIX > >>>> setup code tries to re-use the cached pirq from the first vector > >>>> for each controller, but the hypervisor has already given away that > >>>> pirq to another controller, and its initialization fails. > >>>> > >>>> This is discussed in more detail at: > >>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > >>>> > >>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > >>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > >>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >> This doesn't seem to be applied yet, is it still waiting on another > >> ack? Or maybe I'm looking at the wrong git tree... > > Am I wrong or shouldn't this go through the PCI tree? Konrad? > > Konrad is away this week but since pull request for Xen tree just went > out we should probably wait until rc1 anyway (unless something big comes > up before that). I assume this should go via the Xen or x86 tree, since that's how most arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. If you think otherwise, let me know. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-02-22 14:28 ` Bjorn Helgaas @ 2017-02-22 15:14 ` Boris Ostrovsky 2017-05-03 18:19 ` [Xen-devel] " David Woodhouse 0 siblings, 1 reply; 34+ messages in thread From: Boris Ostrovsky @ 2017-02-22 15:14 UTC (permalink / raw) To: Bjorn Helgaas Cc: Juergen Gross, Dan Streetman, Konrad Rzeszutek Wilk, Stefano Stabellini, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: >> On 02/21/2017 10:45 AM, Juergen Gross wrote: >>> On 21/02/17 16:31, Dan Streetman wrote: >>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk >>>> <konrad.wilk@oracle.com> wrote: >>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: >>>>>> Revert the main part of commit: >>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>>>> >>>>>> That commit introduced reading the pci device's msi message data to see >>>>>> if a pirq was previously configured for the device's msi/msix, and re-use >>>>>> that pirq. At the time, that was the correct behavior. However, a >>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap >>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX >>>>>> vectors; specifically the Qemu commit: >>>>>> c976437c7dba9c7444fb41df45468968aaa326ad >>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") >>>>>> >>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the >>>>>> kernel to re-use the pirq number cached in the pci device msi message >>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the >>>>>> pirqs when the pci device disables its MSI/MSIX vectors. >>>>>> >>>>>> This bug is causing failures to initialize multiple NVMe controllers >>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for >>>>>> each controller (concurrently), and then after using that to talk to >>>>>> the controller for some configuration data, it disables the single MSIX >>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX >>>>>> setup code tries to re-use the cached pirq from the first vector >>>>>> for each controller, but the hypervisor has already given away that >>>>>> pirq to another controller, and its initialization fails. >>>>>> >>>>>> This is discussed in more detail at: >>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html >>>>>> >>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>>> This doesn't seem to be applied yet, is it still waiting on another >>>> ack? Or maybe I'm looking at the wrong git tree... >>> Am I wrong or shouldn't this go through the PCI tree? Konrad? >> Konrad is away this week but since pull request for Xen tree just went >> out we should probably wait until rc1 anyway (unless something big comes >> up before that). > I assume this should go via the Xen or x86 tree, since that's how most > arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > If you think otherwise, let me know. OK, I applied it to Xen tree's for-linus-4.11. -boris ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-02-22 15:14 ` Boris Ostrovsky @ 2017-05-03 18:19 ` David Woodhouse 2017-05-03 18:43 ` Boris Ostrovsky 0 siblings, 1 reply; 34+ messages in thread From: David Woodhouse @ 2017-05-03 18:19 UTC (permalink / raw) To: Boris Ostrovsky, Bjorn Helgaas, stable Cc: Juergen Gross, Stefano Stabellini, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman [-- Attachment #1: Type: text/plain, Size: 3429 bytes --] On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: > On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > > > > On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > > > > > > On 02/21/2017 10:45 AM, Juergen Gross wrote: > > > > > > > > On 21/02/17 16:31, Dan Streetman wrote: > > > > > > > > > > On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > > > > > <konrad.wilk@oracle.com> wrote: > > > > > > > > > > > > On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > > > > > > > > > > > > > > Revert the main part of commit: > > > > > > > af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > > > > > > > > > > > > That commit introduced reading the pci device's msi message data to see > > > > > > > if a pirq was previously configured for the device's msi/msix, and re-use > > > > > > > that pirq. At the time, that was the correct behavior. However, a > > > > > > > later change to Qemu caused it to call into the Xen hypervisor to unmap > > > > > > > all pirqs for a pci device, when the pci device disables its MSI/MSIX > > > > > > > vectors; specifically the Qemu commit: > > > > > > > c976437c7dba9c7444fb41df45468968aaa326ad > > > > > > > ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > > > > > > > > > > > > > Once Qemu added this pirq unmapping, it was no longer correct for the > > > > > > > kernel to re-use the pirq number cached in the pci device msi message > > > > > > > data. All Qemu releases since 2.1.0 contain the patch that unmaps the > > > > > > > pirqs when the pci device disables its MSI/MSIX vectors. > > > > > > > > > > > > > > This bug is causing failures to initialize multiple NVMe controllers > > > > > > > under Xen, because the NVMe driver sets up a single MSIX vector for > > > > > > > each controller (concurrently), and then after using that to talk to > > > > > > > the controller for some configuration data, it disables the single MSIX > > > > > > > vector and re-configures all the MSIX vectors it needs. So the MSIX > > > > > > > setup code tries to re-use the cached pirq from the first vector > > > > > > > for each controller, but the hypervisor has already given away that > > > > > > > pirq to another controller, and its initialization fails. > > > > > > > > > > > > > > This is discussed in more detail at: > > > > > > > https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > > > > > > > > > > > > > Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > > > > > Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > > > > > Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > > > This doesn't seem to be applied yet, is it still waiting on another > > > > > ack? Or maybe I'm looking at the wrong git tree... > > > > Am I wrong or shouldn't this go through the PCI tree? Konrad? > > > Konrad is away this week but since pull request for Xen tree just went > > > out we should probably wait until rc1 anyway (unless something big comes > > > up before that). > > I assume this should go via the Xen or x86 tree, since that's how most > > arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > > If you think otherwise, let me know. > > OK, I applied it to Xen tree's for-linus-4.11. Hm, we want this (c74fd80f2f4) in stable too, don't we? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 4938 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-05-03 18:19 ` [Xen-devel] " David Woodhouse @ 2017-05-03 18:43 ` Boris Ostrovsky 2017-05-03 22:59 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Boris Ostrovsky @ 2017-05-03 18:43 UTC (permalink / raw) To: David Woodhouse, Bjorn Helgaas, stable, Stefano Stabellini Cc: Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman On 05/03/2017 02:19 PM, David Woodhouse wrote: > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote: >>>>> On 21/02/17 16:31, Dan Streetman wrote: >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk >>>>>> <konrad.wilk@oracle.com> wrote: >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: >>>>>>>> Revert the main part of commit: >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>>>>>> >>>>>>>> That commit introduced reading the pci device's msi message data to see >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use >>>>>>>> that pirq. At the time, that was the correct behavior. However, a >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX >>>>>>>> vectors; specifically the Qemu commit: >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") >>>>>>>> >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message >>>>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors. >>>>>>>> >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for >>>>>>>> each controller (concurrently), and then after using that to talk to >>>>>>>> the controller for some configuration data, it disables the single MSIX >>>>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX >>>>>>>> setup code tries to re-use the cached pirq from the first vector >>>>>>>> for each controller, but the hypervisor has already given away that >>>>>>>> pirq to another controller, and its initialization fails. >>>>>>>> >>>>>>>> This is discussed in more detail at: >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html >>>>>>>> >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>>>>> This doesn't seem to be applied yet, is it still waiting on another >>>>>> ack? Or maybe I'm looking at the wrong git tree... >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad? >>>> Konrad is away this week but since pull request for Xen tree just went >>>> out we should probably wait until rc1 anyway (unless something big comes >>>> up before that). >>> I assume this should go via the Xen or x86 tree, since that's how most >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. >>> If you think otherwise, let me know. >> OK, I applied it to Xen tree's for-linus-4.11. > Hm, we want this (c74fd80f2f4) in stable too, don't we? Maybe. Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html it may break things on older (4.4-) hypervisors. They are out of support, which is why this patch went in now but I am not sure this automatically applies to stable kernels. Stefano? -boris ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-05-03 18:43 ` Boris Ostrovsky @ 2017-05-03 22:59 ` Stefano Stabellini 2017-05-03 23:06 ` Greg KH 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-05-03 22:59 UTC (permalink / raw) To: Boris Ostrovsky Cc: David Woodhouse, Bjorn Helgaas, stable, Stefano Stabellini, Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman On Wed, 3 May 2017, Boris Ostrovsky wrote: > On 05/03/2017 02:19 PM, David Woodhouse wrote: > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote: > >>>>> On 21/02/17 16:31, Dan Streetman wrote: > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > >>>>>> <konrad.wilk@oracle.com> wrote: > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > >>>>>>>> Revert the main part of commit: > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > >>>>>>>> > >>>>>>>> That commit introduced reading the pci device's msi message data to see > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use > >>>>>>>> that pirq. At the time, that was the correct behavior. However, a > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX > >>>>>>>> vectors; specifically the Qemu commit: > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") > >>>>>>>> > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message > >>>>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors. > >>>>>>>> > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for > >>>>>>>> each controller (concurrently), and then after using that to talk to > >>>>>>>> the controller for some configuration data, it disables the single MSIX > >>>>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX > >>>>>>>> setup code tries to re-use the cached pirq from the first vector > >>>>>>>> for each controller, but the hypervisor has already given away that > >>>>>>>> pirq to another controller, and its initialization fails. > >>>>>>>> > >>>>>>>> This is discussed in more detail at: > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > >>>>>>>> > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >>>>>> This doesn't seem to be applied yet, is it still waiting on another > >>>>>> ack? Or maybe I'm looking at the wrong git tree... > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad? > >>>> Konrad is away this week but since pull request for Xen tree just went > >>>> out we should probably wait until rc1 anyway (unless something big comes > >>>> up before that). > >>> I assume this should go via the Xen or x86 tree, since that's how most > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > >>> If you think otherwise, let me know. > >> OK, I applied it to Xen tree's for-linus-4.11. > > Hm, we want this (c74fd80f2f4) in stable too, don't we? > > > Maybe. > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html > it may break things on older (4.4-) hypervisors. They are out of > support, which is why this patch went in now but I am not sure this > automatically applies to stable kernels. > > Stefano? This is a difficult call. We could just say that all the broken Xen versions are out of support, so let's fix all the Linux kernel stable trees that we can. Or we could give a look at the release dates. Linux 3.18 is still maintained and was tagged on Dec 7 2014. Xen 4.4 was tagged on Mar 10 2014. We could ask a backport for [4.5+], which was released 2 years after Xen 4.4. Of course, it is still arbitrary but aims at minimizing breakages. What do you think? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-05-03 22:59 ` Stefano Stabellini @ 2017-05-03 23:06 ` Greg KH 2017-05-03 23:12 ` Stefano Stabellini 0 siblings, 1 reply; 34+ messages in thread From: Greg KH @ 2017-05-03 23:06 UTC (permalink / raw) To: Stefano Stabellini Cc: Boris Ostrovsky, David Woodhouse, Bjorn Helgaas, stable, Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote: > On Wed, 3 May 2017, Boris Ostrovsky wrote: > > On 05/03/2017 02:19 PM, David Woodhouse wrote: > > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: > > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote: > > >>>>> On 21/02/17 16:31, Dan Streetman wrote: > > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > > >>>>>> <konrad.wilk@oracle.com> wrote: > > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > > >>>>>>>> Revert the main part of commit: > > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > >>>>>>>> > > >>>>>>>> That commit introduced reading the pci device's msi message data to see > > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use > > >>>>>>>> that pirq. At the time, that was the correct behavior. However, a > > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap > > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX > > >>>>>>>> vectors; specifically the Qemu commit: > > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad > > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > >>>>>>>> > > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the > > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message > > >>>>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the > > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors. > > >>>>>>>> > > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers > > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for > > >>>>>>>> each controller (concurrently), and then after using that to talk to > > >>>>>>>> the controller for some configuration data, it disables the single MSIX > > >>>>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX > > >>>>>>>> setup code tries to re-use the cached pirq from the first vector > > >>>>>>>> for each controller, but the hypervisor has already given away that > > >>>>>>>> pirq to another controller, and its initialization fails. > > >>>>>>>> > > >>>>>>>> This is discussed in more detail at: > > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > >>>>>>>> > > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > >>>>>> This doesn't seem to be applied yet, is it still waiting on another > > >>>>>> ack? Or maybe I'm looking at the wrong git tree... > > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad? > > >>>> Konrad is away this week but since pull request for Xen tree just went > > >>>> out we should probably wait until rc1 anyway (unless something big comes > > >>>> up before that). > > >>> I assume this should go via the Xen or x86 tree, since that's how most > > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > > >>> If you think otherwise, let me know. > > >> OK, I applied it to Xen tree's for-linus-4.11. > > > Hm, we want this (c74fd80f2f4) in stable too, don't we? > > > > > > Maybe. > > > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html > > it may break things on older (4.4-) hypervisors. They are out of > > support, which is why this patch went in now but I am not sure this > > automatically applies to stable kernels. > > > > Stefano? > > This is a difficult call. We could just say that all the broken Xen > versions are out of support, so let's fix all the Linux kernel stable > trees that we can. > > Or we could give a look at the release dates. Linux 3.18 is still > maintained and was tagged on Dec 7 2014. Don't do anything "special" for 3.18 if you have to. I'm only semi-maintaining it because some SoC vendors never upstreamed their trees and lots of devices rely on it. None of them use Xen on their platforms, so no need for me to backport any change there. thanks, greg k-h ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-05-03 23:06 ` Greg KH @ 2017-05-03 23:12 ` Stefano Stabellini 2017-05-03 23:19 ` Greg KH 0 siblings, 1 reply; 34+ messages in thread From: Stefano Stabellini @ 2017-05-03 23:12 UTC (permalink / raw) To: Greg KH Cc: Stefano Stabellini, Boris Ostrovsky, David Woodhouse, Bjorn Helgaas, stable, Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman On Wed, 3 May 2017, Greg KH wrote: > On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote: > > On Wed, 3 May 2017, Boris Ostrovsky wrote: > > > On 05/03/2017 02:19 PM, David Woodhouse wrote: > > > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: > > > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > > > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > > > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote: > > > >>>>> On 21/02/17 16:31, Dan Streetman wrote: > > > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > > > >>>>>> <konrad.wilk@oracle.com> wrote: > > > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > > > >>>>>>>> Revert the main part of commit: > > > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > >>>>>>>> > > > >>>>>>>> That commit introduced reading the pci device's msi message data to see > > > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use > > > >>>>>>>> that pirq. At the time, that was the correct behavior. However, a > > > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap > > > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX > > > >>>>>>>> vectors; specifically the Qemu commit: > > > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad > > > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > > >>>>>>>> > > > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the > > > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message > > > >>>>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the > > > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors. > > > >>>>>>>> > > > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers > > > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for > > > >>>>>>>> each controller (concurrently), and then after using that to talk to > > > >>>>>>>> the controller for some configuration data, it disables the single MSIX > > > >>>>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX > > > >>>>>>>> setup code tries to re-use the cached pirq from the first vector > > > >>>>>>>> for each controller, but the hypervisor has already given away that > > > >>>>>>>> pirq to another controller, and its initialization fails. > > > >>>>>>>> > > > >>>>>>>> This is discussed in more detail at: > > > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > > >>>>>>>> > > > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > >>>>>> This doesn't seem to be applied yet, is it still waiting on another > > > >>>>>> ack? Or maybe I'm looking at the wrong git tree... > > > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad? > > > >>>> Konrad is away this week but since pull request for Xen tree just went > > > >>>> out we should probably wait until rc1 anyway (unless something big comes > > > >>>> up before that). > > > >>> I assume this should go via the Xen or x86 tree, since that's how most > > > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > > > >>> If you think otherwise, let me know. > > > >> OK, I applied it to Xen tree's for-linus-4.11. > > > > Hm, we want this (c74fd80f2f4) in stable too, don't we? > > > > > > > > > Maybe. > > > > > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html > > > it may break things on older (4.4-) hypervisors. They are out of > > > support, which is why this patch went in now but I am not sure this > > > automatically applies to stable kernels. > > > > > > Stefano? > > > > This is a difficult call. We could just say that all the broken Xen > > versions are out of support, so let's fix all the Linux kernel stable > > trees that we can. > > > > Or we could give a look at the release dates. Linux 3.18 is still > > maintained and was tagged on Dec 7 2014. > > Don't do anything "special" for 3.18 if you have to. I'm only > semi-maintaining it because some SoC vendors never upstreamed their > trees and lots of devices rely on it. None of them use Xen on their > platforms, so no need for me to backport any change there. Thanks Greg, that is good info. Is 4.4 the oldest Linux tree fully maintained? If so, I think we should just backport this fix to all Linux trees >= 4.4, given that Linux 4.4 is from Jan 2016. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-05-03 23:12 ` Stefano Stabellini @ 2017-05-03 23:19 ` Greg KH 0 siblings, 0 replies; 34+ messages in thread From: Greg KH @ 2017-05-03 23:19 UTC (permalink / raw) To: Stefano Stabellini Cc: Boris Ostrovsky, David Woodhouse, Bjorn Helgaas, stable, Juergen Gross, linux-pci, linux-kernel, Bjorn Helgaas, xen-devel, Dan Streetman, Dan Streetman On Wed, May 03, 2017 at 04:12:29PM -0700, Stefano Stabellini wrote: > On Wed, 3 May 2017, Greg KH wrote: > > On Wed, May 03, 2017 at 03:59:15PM -0700, Stefano Stabellini wrote: > > > On Wed, 3 May 2017, Boris Ostrovsky wrote: > > > > On 05/03/2017 02:19 PM, David Woodhouse wrote: > > > > > On Wed, 2017-02-22 at 10:14 -0500, Boris Ostrovsky wrote: > > > > >> On 02/22/2017 09:28 AM, Bjorn Helgaas wrote: > > > > >>> On Tue, Feb 21, 2017 at 10:58:39AM -0500, Boris Ostrovsky wrote: > > > > >>>> On 02/21/2017 10:45 AM, Juergen Gross wrote: > > > > >>>>> On 21/02/17 16:31, Dan Streetman wrote: > > > > >>>>>> On Fri, Jan 13, 2017 at 5:30 PM, Konrad Rzeszutek Wilk > > > > >>>>>> <konrad.wilk@oracle.com> wrote: > > > > >>>>>>> On Fri, Jan 13, 2017 at 03:07:51PM -0500, Dan Streetman wrote: > > > > >>>>>>>> Revert the main part of commit: > > > > >>>>>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > > >>>>>>>> > > > > >>>>>>>> That commit introduced reading the pci device's msi message data to see > > > > >>>>>>>> if a pirq was previously configured for the device's msi/msix, and re-use > > > > >>>>>>>> that pirq. At the time, that was the correct behavior. However, a > > > > >>>>>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap > > > > >>>>>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX > > > > >>>>>>>> vectors; specifically the Qemu commit: > > > > >>>>>>>> c976437c7dba9c7444fb41df45468968aaa326ad > > > > >>>>>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload") > > > > >>>>>>>> > > > > >>>>>>>> Once Qemu added this pirq unmapping, it was no longer correct for the > > > > >>>>>>>> kernel to re-use the pirq number cached in the pci device msi message > > > > >>>>>>>> data. All Qemu releases since 2.1.0 contain the patch that unmaps the > > > > >>>>>>>> pirqs when the pci device disables its MSI/MSIX vectors. > > > > >>>>>>>> > > > > >>>>>>>> This bug is causing failures to initialize multiple NVMe controllers > > > > >>>>>>>> under Xen, because the NVMe driver sets up a single MSIX vector for > > > > >>>>>>>> each controller (concurrently), and then after using that to talk to > > > > >>>>>>>> the controller for some configuration data, it disables the single MSIX > > > > >>>>>>>> vector and re-configures all the MSIX vectors it needs. So the MSIX > > > > >>>>>>>> setup code tries to re-use the cached pirq from the first vector > > > > >>>>>>>> for each controller, but the hypervisor has already given away that > > > > >>>>>>>> pirq to another controller, and its initialization fails. > > > > >>>>>>>> > > > > >>>>>>>> This is discussed in more detail at: > > > > >>>>>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html > > > > >>>>>>>> > > > > >>>>>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") > > > > >>>>>>>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > > > > >>>>>>> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > > >>>>>> This doesn't seem to be applied yet, is it still waiting on another > > > > >>>>>> ack? Or maybe I'm looking at the wrong git tree... > > > > >>>>> Am I wrong or shouldn't this go through the PCI tree? Konrad? > > > > >>>> Konrad is away this week but since pull request for Xen tree just went > > > > >>>> out we should probably wait until rc1 anyway (unless something big comes > > > > >>>> up before that). > > > > >>> I assume this should go via the Xen or x86 tree, since that's how most > > > > >>> arch/x86/pci/xen.c patches have been handled, including af42b8d12f8a. > > > > >>> If you think otherwise, let me know. > > > > >> OK, I applied it to Xen tree's for-linus-4.11. > > > > > Hm, we want this (c74fd80f2f4) in stable too, don't we? > > > > > > > > > > > > Maybe. > > > > > > > > Per https://lists.xen.org/archives/html/xen-devel/2017-01/msg00987.html > > > > it may break things on older (4.4-) hypervisors. They are out of > > > > support, which is why this patch went in now but I am not sure this > > > > automatically applies to stable kernels. > > > > > > > > Stefano? > > > > > > This is a difficult call. We could just say that all the broken Xen > > > versions are out of support, so let's fix all the Linux kernel stable > > > trees that we can. > > > > > > Or we could give a look at the release dates. Linux 3.18 is still > > > maintained and was tagged on Dec 7 2014. > > > > Don't do anything "special" for 3.18 if you have to. I'm only > > semi-maintaining it because some SoC vendors never upstreamed their > > trees and lots of devices rely on it. None of them use Xen on their > > platforms, so no need for me to backport any change there. > > Thanks Greg, that is good info. Is 4.4 the oldest Linux tree fully > maintained? Well, the one that _I_ fully maintain :) There are some older ones, but you are free to say how far back any patch should go, you're the maintainers of this code, not me... thanks, greg k-h ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data 2017-01-13 20:00 ` Boris Ostrovsky 2017-01-13 20:07 ` Dan Streetman @ 2017-01-13 20:13 ` Dan Streetman 1 sibling, 0 replies; 34+ messages in thread From: Dan Streetman @ 2017-01-13 20:13 UTC (permalink / raw) To: Boris Ostrovsky Cc: Stefano Stabellini, jgross, Konrad Rzeszutek Wilk, Dan Streetman, Bjorn Helgaas, xen-devel, linux-kernel, linux-pci On Fri, Jan 13, 2017 at 3:00 PM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > On 01/13/2017 01:44 PM, Stefano Stabellini wrote: >> On Fri, 13 Jan 2017, Dan Streetman wrote: >>> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: >>>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini >>>> <sstabellini@kernel.org> wrote: >>>>> On Wed, 11 Jan 2017, Dan Streetman wrote: >>>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>>>>> <sstabellini@kernel.org> wrote: >>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>>>>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was >>>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was >>>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device. The previous >>>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be >>>>>>>>>>>>>>>>> allocated from the hypervisor. >>>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs >>>>>>>>>>>>>>>> as we are not reusing them? >>>>>>>>>>>>>>> Don't we free the pirq when we unmap it? >>>>>>>>>>>>>> I think this is actually a bit worse than I initially thought. After >>>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq >>>>>>>>>>>>>> allocation: >>>>>>>>>>>>> Lets include Stefano, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu >>>>>>>>>>>>> feeling as I believe I hit this at some point in the past and >>>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't >>>>>>>>>>>>> find the thread. >>>>>>>>>>>> This issue seems to be caused by: >>>>>>>>>>>> >>>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>>>>>>>>>>> Date: Wed Dec 1 14:51:44 2010 +0000 >>>>>>>>>>>> >>>>>>>>>>>> xen: fix MSI setup and teardown for PV on HVM guests >>>>>>>>>>>> >>>>>>>>>>>> which was a fix to a bug: >>>>>>>>>>>> >>>>>>>>>>>> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>>>>>>>>>>> trying to enable the same MSI for the second time: the old MSI to pirq >>>>>>>>>>>> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>>>>>>>>>>> try to assign a new pirq anyway. >>>>>>>>>>>> A simple way to reproduce this bug is to assign an MSI capable network >>>>>>>>>>>> card to a PV on HVM guest, if the user brings down the corresponding >>>>>>>>>>>> ethernet interface and up again, Linux would fail to enable MSIs on the >>>>>>>>>>>> device. >>>>>>>>>>>> >>>>>>>>>>>> I don't remember any of the details. From the description of this bug, >>>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used >>>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it >>>>>>>>>>>> is described in this email. >>>>>>>>>>>> >>>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced >>>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert >>>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>>>>>>>>>>> behavior conditional to the Xen version. >>>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest? >>>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices. >>>>>>>>> They are the main ones. It is possible to have emulated virtio devices >>>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in >>>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>>>>>>>> too. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on, >>>>>>>>>>> with the patch from this thread (which essentially reverts your commit >>>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of >>>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe, >>>>>>>>>>> since they were freed by the rmmod. >>>>>>>>> I am fine with reverting the old patch, but we need to understand what >>>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI >>>>>>>>> passthrough setup at hand can help with that. >>>>>>>>> >>>>>>>>> In the Xen code I can still see: >>>>>>>>> >>>>>>>>> case ECS_PIRQ: { >>>>>>>>> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>>>>>>>> >>>>>>>>> if ( !pirq ) >>>>>>>>> break; >>>>>>>>> if ( !is_hvm_domain(d1) ) >>>>>>>>> pirq_guest_unbind(d1, pirq); >>>>>>>>> >>>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close >>>>>>>>> if the guest is not an HVM guest. >>>>>>>> I tried an experiment to call get_free_pirq on both sides of a >>>>>>>> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like >>>>>>>> when they shut down, each nic's 3 pirqs are not available until after >>>>>>>> the nic is done shutting down, so it seems like closing the evtchn >>>>>>>> isn't what makes the pirq free. >>>>>>>> >>>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>>>>>>> >>>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>>>>>>> >>>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>>>>>>> 0, xen_pvh_domain() == 0 >>>>>>>> >>>>>>>> just to be sure, a bit of dbg so I know what domain this is :-) >>>>>>>> >>>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>>>>>>> >>>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>>>>>>> none of those become available for get_free_pirq. >>>>>>>> >>>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>>>>>>> >>>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>>>>>>> shutdown; we see those pirqs from nic 4 are now available. So it must >>>>>>>> not be evtchn closing that frees the pirqs. >>>>>>>> >>>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>>>>>>> >>>>>>>> >>>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed >>>>>>>> before that is called: >>>>>>>> >>>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>>>>>>> >>>>>>>> >>>>>>>> I'm still following thru the kernel code, but it's not immediately >>>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any >>>>>>>> idea? >>>>>>>> >>>>>>>> >From the hypervisor code, it seems that the pirq is "available" via >>>>>>>> is_free_pirq(): >>>>>>>> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>>>>>>> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>>>>>>> >>>>>>>> when the evtchn is closed, it does: >>>>>>>> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>>>>>>> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>>>>>>> >>>>>>>> and that call to unmap_domain_pirq_emuirq does: >>>>>>>> info->arch.hvm.emuirq = IRQ_UNBOUND; >>>>>>>> >>>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place >>>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only >>>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>>>>>>> seeing where either of those would be called when all the kernel is >>>>>>>> doing is disabling a pci device. >>>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped: >>>>>>> when Linux disables msi or msix on the device, using the regular pci >>>>>>> config space based method, QEMU (which emulates the PCI config space) >>>>>>> tells Xen to unmap the pirq. >>>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>>>>> >>>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>>>>>> to be called a second time without msis being disabled first, then I >>>>>>> think we can revert the patch. >>>>>> It doesn't seem possible to call it twice from a correctly-behaved >>>>>> driver, but in case of a driver bug that does try to enable msi/msix >>>>>> multiple times without disabling, __pci_enable_msix() only does >>>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>>>>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>>>>> should be changed to warn and also return error, to prevent >>>>>> re-configuring msi/msix if already configured? Or, maybe the warning >>>>>> is enough - the worst thing that reverting the patch does is use extra >>>>>> pirqs, right? >>>>> I think the warning is enough. Can you confirm that with >>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >>>>> >>>>> ifconfig eth0 down; ifconfig eth0 up >>>>> >>>>> still work as expected, no warnings? >>>> yes, with the patch that started this thread - which essentially >>>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no >>>> warnings and ifconfig down ; ifconfig up work as expected. >>>> >>>>> >>>>> It looks like the patch that changed hypervisor (QEMU actually) behavior >>>>> is: >>>>> >>>>> commit c976437c7dba9c7444fb41df45468968aaa326ad >>>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >>>>> Date: Wed May 7 13:41:48 2014 +0000 >>>>> >>>>> qemu-xen: free all the pirqs for msi/msix when driver unload >>>>> >>>>> From this commit onward, QEMU started unmapping pirqs when MSIs are >>>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >>>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >>>>> >>>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >>>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >>>>> and older. Given that Xen 4.4 is out of support, I think we should go >>>>> ahead with it. Opinions? >>> Looks like there's no complaints; is my patch from the start of this >>> thread ok to use, or can you craft a patch to use? My patch's >>> description could use updating to add some of the info/background from >>> this discussion... >> Hi Dan, I would like an explicit Ack from the other maintainers, Boris >> and Juergen. Let me place them in To: to make it more obvious. > > > Where is the patch? I don't think 'git revert' will work. I just re-sent the same patch that originally started this long thread, except with the description updated to include details on the previous commits and a Fixes: line. It reverts the main part of commit af42b8d12f8adec6711cb824549a0edac6a4ae8f, but there are other changes in that commit that either don't apply anymore and/or were already removed/changed (like changes to xen_allocate_pirq_msi function), or changes that don't need reverting (like adding the XEN_PIRQ_MSI_DATA #define, which should stay). The only part that really needed reverting was the code in xen_hvm_setup_msi_irqs() that checked the pirq number that was cached in the pci msi message data. Thanks! > > And Konrad will need to ack it too as he is Xen-PCI maintainer. > > -boris > ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2017-05-03 23:19 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-05 19:28 [PATCH] xen: do not re-use pirq number cached in pci device msi msg data Dan Streetman 2017-01-07 1:06 ` Konrad Rzeszutek Wilk 2017-01-09 14:59 ` Boris Ostrovsky 2017-01-09 15:42 ` [Xen-devel] " Dan Streetman 2017-01-09 15:59 ` Konrad Rzeszutek Wilk 2017-01-09 19:30 ` Stefano Stabellini 2017-01-10 15:57 ` Dan Streetman 2017-01-10 18:41 ` Dan Streetman 2017-01-10 19:03 ` Stefano Stabellini 2017-01-10 21:32 ` Dan Streetman 2017-01-10 23:28 ` Boris Ostrovsky 2017-01-11 1:25 ` Stefano Stabellini 2017-01-11 15:26 ` Dan Streetman 2017-01-11 18:46 ` Stefano Stabellini 2017-01-11 23:25 ` Dan Streetman 2017-01-13 18:31 ` Dan Streetman 2017-01-13 18:44 ` Stefano Stabellini 2017-01-13 20:00 ` Boris Ostrovsky 2017-01-13 20:07 ` Dan Streetman 2017-01-13 20:54 ` Stefano Stabellini 2017-01-13 21:49 ` Boris Ostrovsky 2017-01-13 22:30 ` Konrad Rzeszutek Wilk 2017-02-21 15:31 ` Dan Streetman 2017-02-21 15:45 ` Juergen Gross 2017-02-21 15:58 ` Boris Ostrovsky 2017-02-22 14:28 ` Bjorn Helgaas 2017-02-22 15:14 ` Boris Ostrovsky 2017-05-03 18:19 ` [Xen-devel] " David Woodhouse 2017-05-03 18:43 ` Boris Ostrovsky 2017-05-03 22:59 ` Stefano Stabellini 2017-05-03 23:06 ` Greg KH 2017-05-03 23:12 ` Stefano Stabellini 2017-05-03 23:19 ` Greg KH 2017-01-13 20:13 ` Dan Streetman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).