* RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-04 22:29 [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI Dexuan Cui
@ 2022-11-04 23:38 ` Michael Kelley (LINUX)
2022-11-10 20:49 ` Dexuan Cui
2022-11-10 21:43 ` Bjorn Helgaas
2022-11-11 9:56 ` Lorenzo Pieralisi
2 siblings, 1 reply; 9+ messages in thread
From: Michael Kelley (LINUX) @ 2022-11-04 23:38 UTC (permalink / raw)
To: Dexuan Cui, quic_jhugo, quic_carlv, wei.liu, KY Srinivasan,
Haiyang Zhang, sthemmin, lpieralisi, bhelgaas, linux-hyperv,
linux-pci, linux-kernel, robh, kw, helgaas, alex.williamson,
boqun.feng, Boqun Feng
From: Dexuan Cui <decui@microsoft.com> Sent: Friday, November 4, 2022 3:30 PM
>
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
>
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
>
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
>
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
>
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
>
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>
> ---
>
> v1 is here:
>
> https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Changes in v2:
> round-robin the vCPU for multi-MSI.
> The commit message is re-worked.
> Added Jeff and Carl's Co-developed-by and Signed-off-by.
>
> Changes in v3:
> Michael Kelley kindly helped to make a great comment, and I added the
> comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
>
> Rebased to Hyper-V tree's "hyperv-fixes" branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
>
> Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> through the Hyper-V tree because it's rebased to another hv_pci patch (which
> only exists in the Hyper-V tree for now):
> e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()")
>
> BTW, Michael has some other hv_pci patches, which would also need go through
> the Hyper-V tree:
>
> https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
>
>
> drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
> 1 file changed, 75 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index ba64284eaf9f..fa5a1ba35a82 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct
> pci_response *resp,
> }
>
> static u32 hv_compose_msi_req_v1(
> - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> + struct pci_create_interrupt *int_pkt,
> u32 slot, u8 vector, u16 vector_count)
> {
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
> return sizeof(*int_pkt);
> }
>
> +/*
> + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-
> V
> + * to spread out the pCPUs that it selects.
> + *
> + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> + * to always return the same dummy vCPU, because a second call to
> + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> + * the same pCPU, even though the vCPUs will be spread out by later calls
> + * to hv_irq_unmask(), but that is the best we can do now.
> + *
> + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
> + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> + * enhancement is planned for a future version. With that enhancement, the
> + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI
> + * device will be spread across multiple pCPUs.
> + */
> +
> /*
> * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
> * by subsequent retarget in hv_irq_unmask().
> @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct
> cpumask *affinity)
> return cpumask_first_and(affinity, cpu_online_mask);
> }
>
> -static u32 hv_compose_msi_req_v2(
> - struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
> - u32 slot, u8 vector, u16 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
> {
> + static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> + /* -1 means starting with CPU 0 */
> + static int cpu_next = -1;
> +
> + unsigned long flags;
> int cpu;
>
> + spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> + false);
> + cpu = cpu_next;
> +
> + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> + return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> + struct pci_create_interrupt2 *int_pkt, int cpu,
> + u32 slot, u8 vector, u16 vector_count)
> +{
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
> int_pkt->wslot.slot = slot;
> int_pkt->int_desc.vector = vector;
> int_pkt->int_desc.vector_count = vector_count;
> int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> - cpu = hv_compose_msi_req_get_cpu(affinity);
> int_pkt->int_desc.processor_array[0] =
> hv_cpu_number_to_vp_number(cpu);
> int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2(
> }
>
> static u32 hv_compose_msi_req_v3(
> - struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
> + struct pci_create_interrupt3 *int_pkt, int cpu,
> u32 slot, u32 vector, u16 vector_count)
> {
> - int cpu;
> -
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
> int_pkt->wslot.slot = slot;
> int_pkt->int_desc.vector = vector;
> int_pkt->int_desc.reserved = 0;
> int_pkt->int_desc.vector_count = vector_count;
> int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> - cpu = hv_compose_msi_req_get_cpu(affinity);
> int_pkt->int_desc.processor_array[0] =
> hv_cpu_number_to_vp_number(cpu);
> int_pkt->int_desc.processor_count = 1;
> @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> struct pci_create_interrupt3 v3;
> } int_pkts;
> } __packed ctxt;
> + bool multi_msi;
> u64 trans_id;
> u32 size;
> int ret;
> + int cpu;
> +
> + msi_desc = irq_data_get_msi_desc(data);
> + multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> + msi_desc->nvec_used > 1;
>
> /* Reuse the previous allocation */
> - if (data->chip_data) {
> + if (data->chip_data && multi_msi) {
> int_desc = data->chip_data;
> msg->address_hi = int_desc->address >> 32;
> msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> return;
> }
>
> - msi_desc = irq_data_get_msi_desc(data);
> pdev = msi_desc_to_pci_dev(msi_desc);
> dest = irq_data_get_effective_affinity_mask(data);
> pbus = pdev->bus;
> @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> if (!hpdev)
> goto return_null_message;
>
> + /* Free any previous message that might have already been composed. */
> + if (data->chip_data && !multi_msi) {
> + int_desc = data->chip_data;
> + data->chip_data = NULL;
> + hv_int_desc_free(hpdev, int_desc);
> + }
> +
> int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
> if (!int_desc)
> goto drop_reference;
>
> - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> + if (multi_msi) {
> /*
> * If this is not the first MSI of Multi MSI, we already have
> * a mapping. Can exit early.
> @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> */
> vector = 32;
> vector_count = msi_desc->nvec_used;
> + cpu = hv_compose_multi_msi_req_get_cpu();
> } else {
> vector = hv_msi_get_int_vector(data);
> vector_count = 1;
> + cpu = hv_compose_msi_req_get_cpu(dest);
> }
>
> /*
> @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> switch (hbus->protocol_version) {
> case PCI_PROTOCOL_VERSION_1_1:
> size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> - dest,
> hpdev->desc.win_slot.slot,
> (u8)vector,
> vector_count);
> @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> case PCI_PROTOCOL_VERSION_1_2:
> case PCI_PROTOCOL_VERSION_1_3:
> size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> - dest,
> + cpu,
> hpdev->desc.win_slot.slot,
> (u8)vector,
> vector_count);
> @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>
> case PCI_PROTOCOL_VERSION_1_4:
> size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> - dest,
> + cpu,
> hpdev->desc.win_slot.slot,
> vector,
> vector_count);
> --
> 2.25.1
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-04 23:38 ` Michael Kelley (LINUX)
@ 2022-11-10 20:49 ` Dexuan Cui
0 siblings, 0 replies; 9+ messages in thread
From: Dexuan Cui @ 2022-11-10 20:49 UTC (permalink / raw)
To: Michael Kelley (LINUX),
quic_jhugo, quic_carlv, wei.liu, KY Srinivasan, Haiyang Zhang,
sthemmin, lpieralisi, bhelgaas, linux-hyperv, linux-pci,
linux-kernel, robh, kw, helgaas, alex.williamson, boqun.feng,
Boqun Feng
> Sent: Friday, November 4, 2022 4:38 PM
> > ...
> > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> > through the Hyper-V tree because it's rebased to another hv_pci patch
> > (which only exists in the Hyper-V tree for now):
> > e70af8d040d2 ("PCI: hv: Fix the definition of vector in
> > hv_compose_msi_msg()")
> >
> > BTW, Michael has some other hv_pci patches, which would also need go
> > through the Hyper-V tree:
> >
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Hi Bjorn, Lorenzo, if you have no objection to the patch, I suggest Wei merge
it through the Hyper-V tree early next week.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-04 22:29 [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI Dexuan Cui
2022-11-04 23:38 ` Michael Kelley (LINUX)
@ 2022-11-10 21:43 ` Bjorn Helgaas
2022-11-11 7:17 ` Dexuan Cui
2022-11-11 9:56 ` Lorenzo Pieralisi
2 siblings, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2022-11-10 21:43 UTC (permalink / raw)
To: Dexuan Cui
Cc: quic_jhugo, quic_carlv, wei.liu, kys, haiyangz, sthemmin,
lpieralisi, bhelgaas, linux-hyperv, linux-pci, linux-kernel,
mikelley, robh, kw, alex.williamson, boqun.feng, Boqun.Feng
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
>
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
>
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
s/determinted/determined/
> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
>
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
>
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
s/for ever/forever/
> the old affected versions of Hyper-V are no longer supported).
>
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>
> ---
>
> v1 is here:
> https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Changes in v2:
> round-robin the vCPU for multi-MSI.
> The commit message is re-worked.
> Added Jeff and Carl's Co-developed-by and Signed-off-by.
>
> Changes in v3:
> Michael Kelley kindly helped to make a great comment, and I added the
> comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
>
> Rebased to Hyper-V tree's "hyperv-fixes" branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
>
> Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> through the Hyper-V tree because it's rebased to another hv_pci patch (which
> only exists in the Hyper-V tree for now):
> e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()")
Fine with me, but it's Lorenzo's area so I don't want to preemptively
ack it.
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-10 21:43 ` Bjorn Helgaas
@ 2022-11-11 7:17 ` Dexuan Cui
0 siblings, 0 replies; 9+ messages in thread
From: Dexuan Cui @ 2022-11-11 7:17 UTC (permalink / raw)
To: Bjorn Helgaas, lpieralisi, wei.liu
Cc: quic_jhugo, quic_carlv, KY Srinivasan, Haiyang Zhang, sthemmin,
bhelgaas, linux-hyperv, linux-pci, linux-kernel,
Michael Kelley (LINUX),
robh, kw, alex.williamson, boqun.feng, Boqun Feng
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Thursday, November 10, 2022 1:44 PM
> > ...
> > Note: before b4b77778ecc5, the pCPU is determined twice, and when the
> > pCPU is determinted the second time, the vCPU in the effective affinity
> > mask is
>
> s/determinted/determined/
Thanks, Bjorn! I suppose Wei can help fix this :-)
> > The hypercall fix may not be backported to all old versions of Hyper-V, so
> > we want to have this guest side change for ever (or at least till we're sure
>
> s/for ever/forever/
ditto.
> > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> > through the Hyper-V tree because it's rebased to another hv_pci patch
> > (which only exists in the Hyper-V tree for now):
> > e70af8d040d2 ("PCI: hv: Fix the definition of vector in
> > hv_compose_msi_msg()")
>
> Fine with me, but it's Lorenzo's area so I don't want to preemptively
> ack it.
Thanks. Hopefully Lorenzo can take a look soon.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-04 22:29 [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI Dexuan Cui
2022-11-04 23:38 ` Michael Kelley (LINUX)
2022-11-10 21:43 ` Bjorn Helgaas
@ 2022-11-11 9:56 ` Lorenzo Pieralisi
2022-11-11 23:40 ` Wei Liu
2022-11-12 0:58 ` Dexuan Cui
2 siblings, 2 replies; 9+ messages in thread
From: Lorenzo Pieralisi @ 2022-11-11 9:56 UTC (permalink / raw)
To: Dexuan Cui
Cc: quic_jhugo, quic_carlv, wei.liu, kys, haiyangz, sthemmin,
bhelgaas, linux-hyperv, linux-pci, linux-kernel, mikelley, robh,
kw, helgaas, alex.williamson, boqun.feng, Boqun.Feng
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
>
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
>
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
>
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
>
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
>
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>
> ---
>
> v1 is here:
> https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
>
> Changes in v2:
> round-robin the vCPU for multi-MSI.
> The commit message is re-worked.
> Added Jeff and Carl's Co-developed-by and Signed-off-by.
>
> Changes in v3:
> Michael Kelley kindly helped to make a great comment, and I added the
> comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
>
> Rebased to Hyper-V tree's "hyperv-fixes" branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
>
> Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> through the Hyper-V tree because it's rebased to another hv_pci patch (which
> only exists in the Hyper-V tree for now):
> e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()")
>
> BTW, Michael has some other hv_pci patches, which would also need go through
> the Hyper-V tree:
> https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
>
>
> drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
> 1 file changed, 75 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index ba64284eaf9f..fa5a1ba35a82 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
> }
>
> static u32 hv_compose_msi_req_v1(
> - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> + struct pci_create_interrupt *int_pkt,
> u32 slot, u8 vector, u16 vector_count)
> {
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
> return sizeof(*int_pkt);
> }
>
> +/*
> + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> + * to spread out the pCPUs that it selects.
> + *
> + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> + * to always return the same dummy vCPU, because a second call to
> + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
Why ? Can't you fix _that_ ? Why can't the initial call to
hv_compose_msi_msg() determine the _real_ target vCPU ?
> + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> + * the same pCPU, even though the vCPUs will be spread out by later calls
> + * to hv_irq_unmask(), but that is the best we can do now.
> + *
> + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
"current" Hyper-V means nothing, remove it or provide versioning
information. Imagine yourself reading this comment some time
in the future.
I can't claim to understand how this MSI vCPU to pCPU mapping is made to
work in current code but I can't ack this patch sorry, if you feel like
it is good to merge it it is your and Hyper-V maintainers call, feel
free to go ahead - I can review PCI hyper-V changes that affect PCI
and IRQs core APIs, I don't know Hyper-V internals.
Lorenzo
> + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> + * enhancement is planned for a future version. With that enhancement, the
> + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI
> + * device will be spread across multiple pCPUs.
> + */
> +
> /*
> * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
> * by subsequent retarget in hv_irq_unmask().
> @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity)
> return cpumask_first_and(affinity, cpu_online_mask);
> }
>
> -static u32 hv_compose_msi_req_v2(
> - struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
> - u32 slot, u8 vector, u16 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
> {
> + static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> + /* -1 means starting with CPU 0 */
> + static int cpu_next = -1;
> +
> + unsigned long flags;
> int cpu;
>
> + spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> + false);
> + cpu = cpu_next;
> +
> + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> + return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> + struct pci_create_interrupt2 *int_pkt, int cpu,
> + u32 slot, u8 vector, u16 vector_count)
> +{
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
> int_pkt->wslot.slot = slot;
> int_pkt->int_desc.vector = vector;
> int_pkt->int_desc.vector_count = vector_count;
> int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> - cpu = hv_compose_msi_req_get_cpu(affinity);
> int_pkt->int_desc.processor_array[0] =
> hv_cpu_number_to_vp_number(cpu);
> int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2(
> }
>
> static u32 hv_compose_msi_req_v3(
> - struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
> + struct pci_create_interrupt3 *int_pkt, int cpu,
> u32 slot, u32 vector, u16 vector_count)
> {
> - int cpu;
> -
> int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
> int_pkt->wslot.slot = slot;
> int_pkt->int_desc.vector = vector;
> int_pkt->int_desc.reserved = 0;
> int_pkt->int_desc.vector_count = vector_count;
> int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> - cpu = hv_compose_msi_req_get_cpu(affinity);
> int_pkt->int_desc.processor_array[0] =
> hv_cpu_number_to_vp_number(cpu);
> int_pkt->int_desc.processor_count = 1;
> @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> struct pci_create_interrupt3 v3;
> } int_pkts;
> } __packed ctxt;
> + bool multi_msi;
> u64 trans_id;
> u32 size;
> int ret;
> + int cpu;
> +
> + msi_desc = irq_data_get_msi_desc(data);
> + multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> + msi_desc->nvec_used > 1;
>
> /* Reuse the previous allocation */
> - if (data->chip_data) {
> + if (data->chip_data && multi_msi) {
> int_desc = data->chip_data;
> msg->address_hi = int_desc->address >> 32;
> msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> return;
> }
>
> - msi_desc = irq_data_get_msi_desc(data);
> pdev = msi_desc_to_pci_dev(msi_desc);
> dest = irq_data_get_effective_affinity_mask(data);
> pbus = pdev->bus;
> @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> if (!hpdev)
> goto return_null_message;
>
> + /* Free any previous message that might have already been composed. */
> + if (data->chip_data && !multi_msi) {
> + int_desc = data->chip_data;
> + data->chip_data = NULL;
> + hv_int_desc_free(hpdev, int_desc);
> + }
> +
> int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
> if (!int_desc)
> goto drop_reference;
>
> - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> + if (multi_msi) {
> /*
> * If this is not the first MSI of Multi MSI, we already have
> * a mapping. Can exit early.
> @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> */
> vector = 32;
> vector_count = msi_desc->nvec_used;
> + cpu = hv_compose_multi_msi_req_get_cpu();
> } else {
> vector = hv_msi_get_int_vector(data);
> vector_count = 1;
> + cpu = hv_compose_msi_req_get_cpu(dest);
> }
>
> /*
> @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> switch (hbus->protocol_version) {
> case PCI_PROTOCOL_VERSION_1_1:
> size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> - dest,
> hpdev->desc.win_slot.slot,
> (u8)vector,
> vector_count);
> @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> case PCI_PROTOCOL_VERSION_1_2:
> case PCI_PROTOCOL_VERSION_1_3:
> size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> - dest,
> + cpu,
> hpdev->desc.win_slot.slot,
> (u8)vector,
> vector_count);
> @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>
> case PCI_PROTOCOL_VERSION_1_4:
> size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> - dest,
> + cpu,
> hpdev->desc.win_slot.slot,
> vector,
> vector_count);
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-11 9:56 ` Lorenzo Pieralisi
@ 2022-11-11 23:40 ` Wei Liu
2022-11-12 0:58 ` Dexuan Cui
1 sibling, 0 replies; 9+ messages in thread
From: Wei Liu @ 2022-11-11 23:40 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: Dexuan Cui, quic_jhugo, quic_carlv, wei.liu, kys, haiyangz,
sthemmin, bhelgaas, linux-hyperv, linux-pci, linux-kernel,
mikelley, robh, kw, helgaas, alex.williamson, boqun.feng,
Boqun.Feng
On Fri, Nov 11, 2022 at 10:56:28AM +0100, Lorenzo Pieralisi wrote:
> On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> >
> > It turns out that the third patch (b4b77778ecc5) causes a performance
> > regression because all the interrupts now happen on 1 physical CPU (or two
> > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> > devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> >
> > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> > hv_irq_unmask() -> hv_arch_irq_unmask() ->
> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> > determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> > consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> >
> > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> > is determinted the second time, the vCPU in the effective affinity mask is
> > used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> > pCPU for each interrupt.
> >
> > The hypercall will be fixed in future to update the pCPU as well, but
> > that will take quite a while, so let's restore the old behavior in
> > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> > manner for each PCI device, so the interrupts of different devices can
> > happen on different pCPUs, though the interrupts of each device happen on
> > some single pCPU.
> >
> > The hypercall fix may not be backported to all old versions of Hyper-V, so
> > we want to have this guest side change for ever (or at least till we're sure
> > the old affected versions of Hyper-V are no longer supported).
> >
> > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >
> > ---
> >
> > v1 is here:
> > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> >
> > Changes in v2:
> > round-robin the vCPU for multi-MSI.
> > The commit message is re-worked.
> > Added Jeff and Carl's Co-developed-by and Signed-off-by.
> >
> > Changes in v3:
> > Michael Kelley kindly helped to make a great comment, and I added the
> > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> >
> > Rebased to Hyper-V tree's "hyperv-fixes" branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> >
> > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> > through the Hyper-V tree because it's rebased to another hv_pci patch (which
> > only exists in the Hyper-V tree for now):
> > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()")
> >
> > BTW, Michael has some other hv_pci patches, which would also need go through
> > the Hyper-V tree:
> > https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
> >
> >
> > drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
> > 1 file changed, 75 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index ba64284eaf9f..fa5a1ba35a82 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
> > }
> >
> > static u32 hv_compose_msi_req_v1(
> > - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> > + struct pci_create_interrupt *int_pkt,
> > u32 slot, u8 vector, u16 vector_count)
> > {
> > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
> > return sizeof(*int_pkt);
> > }
> >
> > +/*
> > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> > + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> > + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> > + * to spread out the pCPUs that it selects.
> > + *
> > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
>
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?
Dexuan, any comment on this?
>
> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
>
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.
>
And this?
Wei.
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-11 9:56 ` Lorenzo Pieralisi
2022-11-11 23:40 ` Wei Liu
@ 2022-11-12 0:58 ` Dexuan Cui
2022-11-12 12:43 ` Wei Liu
1 sibling, 1 reply; 9+ messages in thread
From: Dexuan Cui @ 2022-11-12 0:58 UTC (permalink / raw)
To: Lorenzo Pieralisi, wei.liu
Cc: quic_jhugo, quic_carlv, KY Srinivasan, Haiyang Zhang, sthemmin,
bhelgaas, linux-hyperv, linux-pci, linux-kernel,
Michael Kelley (LINUX),
robh, kw, helgaas, alex.williamson, boqun.feng, Boqun Feng
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Friday, November 11, 2022 1:56 AM
> > ...
> > + * For the single-MSI and MSI-X cases, it's OK for
> > hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> > choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP,
> > so the
>
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?
Unluckily I can't fix this because of the way it works in x86's irq management
code. This is out of the control of the pci-hyperv driver.
hv_compose_msi_msg() uses irq_data_get_effective_affinity_mask() to get
the "effective"mask.
On x86, when the irq is initialized, irq_data_update_effective_affinity() is
called from apic_update_irq_cfg() -- please refer to the below debug code.
When the initial call to hv_compose_msi_msg() is invoked, it's from
pci_alloc_irq_vectors(), and the x86 irq code always passes CPU0 to pci-hyperv.
Please see the below "cpumask_first(cpu_online_mask)" in
vector_assign_managed_shutdown().
When hv_compose_msi_msg() is invoked the second time, it's from
request_irq(), and the x86 irq code passes the real effectivey CPU to pci-hyperv.
I added the below debug code and pasted the trimmed output below.
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -179,6 +179,7 @@ static void vector_assign_managed_shutdown(struct irq_data *irqd)
unsigned int cpu = cpumask_first(cpu_online_mask);
apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+ WARN(irqd->irq >= 24, "cdx: vector_assign_managed_shutdown: irq=%d, cpu=%d\n", irqd->irq, cpu);
}
static int reserve_managed_vector(struct irq_data *irqd)
@@ -251,6 +252,7 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest)
return vector;
apic_update_vector(irqd, vector, cpu);
apic_update_irq_cfg(irqd, vector, cpu);
+ WARN(irqd->irq >= 24, "cdx: assign_vector_locked: irq=%d, cpu=%d\n", irqd->irq, cpu);
return 0;
}
@@ -328,6 +330,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
return vector;
apic_update_vector(irqd, vector, cpu);
apic_update_irq_cfg(irqd, vector, cpu);
+ WARN(irqd->irq >= 24, "cdx: assign_managed_vector: irq=%d, cpu=%d\n", irqd->irq, cpu);
return 0;
}
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1803,6 +1803,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
u32 size;
int ret;
+ WARN(1, "cdx: hv_compose_msi_msg: irq=%d\n", data->irq);
/* Reuse the previous allocation */
if (data->chip_data) {
int_desc = data->chip_data;
1 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
2 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
3 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
4 reserve_irq_vector_locked+0x41/0xa0
5 x86_vector_alloc_irqs+0x298/0x460
6 irq_domain_alloc_irqs_hierarchy+0x1b/0x50
7 irq_domain_alloc_irqs_parent+0x17/0x30
8 msi_domain_alloc+0x83/0x150
9 irq_domain_alloc_irqs_hierarchy+0x1b/0x50
10 __irq_domain_alloc_irqs+0xdf/0x320
11 __msi_domain_alloc_irqs+0x103/0x3e0
12 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
13 pci_msi_setup_msi_irqs+0x2d/0x40
14 __pci_enable_msix_range+0x2fd/0x420
15 pci_alloc_irq_vectors_affinity+0xb0/0x110
16 nvme_reset_work+0x1cf/0x1170
17 process_one_work+0x21f/0x3f0
18 worker_thread+0x4a/0x3c0
19 kthread+0xff/0x130
20 ret_from_fork+0x22/0x30
21
22 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
23 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
24 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
25 x86_vector_activate+0x149/0x1e0
26 __irq_domain_activate_irq+0x58/0x90
27 __irq_domain_activate_irq+0x38/0x90
28 irq_domain_activate_irq+0x2d/0x50
29 __msi_domain_alloc_irqs+0x1bb/0x3e0
30 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
31 pci_msi_setup_msi_irqs+0x2d/0x40
32 __pci_enable_msix_range+0x2fd/0x420
33 pci_alloc_irq_vectors_affinity+0xb0/0x110
34 nvme_reset_work+0x1cf/0x1170
35 process_one_work+0x21f/0x3f0
36 worker_thread+0x4a/0x3c0
37 kthread+0xff/0x130
38 ret_from_fork+0x22/0x30
39
40
41 cdx: hv_compose_msi_msg: irq=24
42 WARNING: CPU: 3 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
43 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
44 irq_chip_compose_msi_msg+0x32/0x50
45 msi_domain_activate+0x45/0xa0
46 __irq_domain_activate_irq+0x58/0x90
47 irq_domain_activate_irq+0x2d/0x50
48 __msi_domain_alloc_irqs+0x1bb/0x3e0
49 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
50 pci_msi_setup_msi_irqs+0x2d/0x40
51 __pci_enable_msix_range+0x2fd/0x420
52 pci_alloc_irq_vectors_affinity+0xb0/0x110
53 nvme_reset_work+0x1cf/0x1170
54 process_one_work+0x21f/0x3f0
55 worker_thread+0x4a/0x3c0
56 kthread+0xff/0x130
57 ret_from_fork+0x22/0x30
58
59
60
61 cdx: assign_vector_locked: irq=24, cpu=3
62 WARNING: CPU: 0 PID: 87 at arch/x86/kernel/apic/vector.c:255 assign_vector_locked+0x160/0x170
63 RIP: 0010:assign_vector_locked+0x160/0x170
64 assign_irq_vector_any_locked+0x6a/0x150
65 x86_vector_activate+0x93/0x1e0
66 __irq_domain_activate_irq+0x58/0x90
67 __irq_domain_activate_irq+0x38/0x90
68 irq_domain_activate_irq+0x2d/0x50
69 irq_activate+0x29/0x30
70 __setup_irq+0x2e5/0x780
71 request_threaded_irq+0x112/0x180
72 pci_request_irq+0xa3/0xf0
73 queue_request_irq+0x74/0x80
74 nvme_reset_work+0x408/0x1170
75 process_one_work+0x21f/0x3f0
76 worker_thread+0x4a/0x3c0
77 kthread+0xff/0x130
78 ret_from_fork+0x22/0x30
79
80 cdx: hv_compose_msi_msg: irq=24
81 WARNING: CPU: 0 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
82 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
83 irq_chip_compose_msi_msg+0x32/0x50
84 msi_domain_activate+0x45/0xa0
85 __irq_domain_activate_irq+0x58/0x90
86 irq_domain_activate_irq+0x2d/0x50
87 irq_activate+0x29/0x30
88 __setup_irq+0x2e5/0x780
89 request_threaded_irq+0x112/0x180
90 pci_request_irq+0xa3/0xf0
91 queue_request_irq+0x74/0x80
92 nvme_reset_work+0x408/0x1170
93 process_one_work+0x21f/0x3f0
94 worker_thread+0x4a/0x3c0
95 kthread+0xff/0x130
96 ret_from_fork+0x22/0x30
> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed
> > so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up
> > using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does
> *not*
>
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.
Good point. @Wei, can you please help fix this? I think we can change
"With current Hyper-V"
To
"With Hyper-V in Nov 2022".
BTW, it's hard to provide the exact versioning info, because technically
there are many versions of Hyper-V...
> I can't claim to understand how this MSI vCPU to pCPU mapping is made to
> work in current code but I can't ack this patch sorry, if you feel like
> it is good to merge it it is your and Hyper-V maintainers call, feel
> free to go ahead - I can review PCI hyper-V changes that affect PCI
> and IRQs core APIs, I don't know Hyper-V internals.
>
> Lorenzo
I understand. Thanks!
I discussed the issue with Hyper-V team. I believe I understood it and
this patch is the right thing to have.
@Wei, Bjorn spotted two typos in the commit message, and Lorenzo
suggested a change to the above "current". Can you please help
fix these and merge the patch? Or, let me know if it'd be easier if
I should send v4.
Thanks,
Dexuan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI
2022-11-12 0:58 ` Dexuan Cui
@ 2022-11-12 12:43 ` Wei Liu
0 siblings, 0 replies; 9+ messages in thread
From: Wei Liu @ 2022-11-12 12:43 UTC (permalink / raw)
To: Dexuan Cui
Cc: Lorenzo Pieralisi, wei.liu, quic_jhugo, quic_carlv,
KY Srinivasan, Haiyang Zhang, sthemmin, bhelgaas, linux-hyperv,
linux-pci, linux-kernel, Michael Kelley (LINUX),
robh, kw, helgaas, alex.williamson, boqun.feng, Boqun Feng
On Sat, Nov 12, 2022 at 12:58:33AM +0000, Dexuan Cui wrote:
[...]
>
> I discussed the issue with Hyper-V team. I believe I understood it and
> this patch is the right thing to have.
>
> @Wei, Bjorn spotted two typos in the commit message, and Lorenzo
> suggested a change to the above "current". Can you please help
> fix these and merge the patch? Or, let me know if it'd be easier if
> I should send v4.
All fixed and patch applied to hyperv-fixes.
Thanks,
Wei.
>
> Thanks,
> Dexuan
>
^ permalink raw reply [flat|nested] 9+ messages in thread