ntb.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Logan Gunthorpe <logang@deltatee.com>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>, Marc Zygnier <maz@kernel.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Jason Gunthorpe <jgg@nvidia.com>, Megha Dey <megha.dey@intel.com>,
	Ashok Raj <ashok.raj@intel.com>,
	linux-pci@vger.kernel.org,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jon Mason <jdmason@kudzu.us>, Dave Jiang <dave.jiang@intel.com>,
	Allen Hubbe <allenbh@gmail.com>,
	linux-ntb@googlegroups.com, linux-s390@vger.kernel.org,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	x86@kernel.org
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Date: Tue, 30 Nov 2021 01:29:39 +0100	[thread overview]
Message-ID: <87ilwacwp8.ffs@tglx> (raw)
In-Reply-To: <6ba084d6-2b26-7c86-4526-8fcd3d921dfd@deltatee.com>

Logan,

On Mon, Nov 29 2021 at 15:27, Logan Gunthorpe wrote:
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>> 
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>> 
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>> 
>> Could you please shed some light on the larger picture of this?
>
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).

That far I got.

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)

So the whole thing looks like this:

    PCIe
     |
     |   ________________________
     |   |                       |
     |   | NTB                   |
     |   |                       |
     |   | PCI config space      |
     |   |     MSI-X space       |
     |   |_______________________|
     |---|                       |
         | Memory window A       |
         | Memory window ..      |
         | Memory window N       |
         |_______________________|

The peers can inject an interrupt through the associated memory window
like the NTB device itself does via the real MSI-X interrupts by writing
the associated MSI message data to the associated address (either
directly to the targeted APIC or to the IOMMU table).

As this message goes through the NTB device it's tagged as originating
from the NTB PCIe device as seen by the IOMMU/Interrupt remapping unit.

So far so good.

I completely understand why you went down the road to add this
PCI_IRQ_VIRTUAL "feature", but TBH this should have never happened.

Why?

These interrupts have absolutely nothing to do with PCI/MSI-X as defined
by the spec and as handled by the PCI/MSI core.

The fact that the MSI message is transported by PCIe does not change
that at all, neither does it matter that from an IOMMU/Interrupt
remapping POV these messages are tagged to come from that particular
PCIe device.

At the conceptual level these interrupts are in separate irq domains:

     |   _______________________
     |   |                      |
     |   | NTB                  |
     |   |                      |
     |   | PCI config space     |
     |   |     MSI-X space      | <- #1 Global or per IOMMU zone PCI/MSI domain
     |   |_____________________ |
     |---|                      |
         | Memory window A      |
         | Memory window ..     | <- #2 Per device NTB domain
         | Memory window N      |
         |______________________|

You surely can see the disctinction between #1 and #2, right?

And because they are in different domains, they simply cannot share the
interrupt chip implementation taking care of the particular oddities of
the "hardware". I know, you made it 'shared' by adding these
'is_virtual' conditionals all over the place, but that pretty much
defeats the purpose of having separate domains.

This is also reflected in drivers/ntb/msi.c::ntbm_msi_request_threaded_irq():

	for_each_pci_msi_entry(entry, ntb->pdev) {
		if (irq_has_action(entry->irq))
			continue;

What on earth guarantees that this check has any relevance? Just because
an entry does not have an interrupt requested on it does not tell
anything.

That might be an actual entry which belongs to the physical PCI NTB
device MSI-X space which is not requested yet. Just because that
swichtec device does not fall into that category does not make it any
more correct.

Seriously, infrastructure code which relies on undocumented assumptions
based on a particular hardware device is broken by definition.

I'm way too tired to come up with a proper solution for that, but that
PCI_IRQ_VIRTUAL has to die ASAP.

Thanks,

        tglx


  parent reply	other threads:[~2021-11-30  0:29 UTC|newest]

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
2021-11-27 12:19   ` Greg Kroah-Hartman
2021-11-27  1:23 ` [patch 02/32] genirq/msi: Add mutex for MSI list protection Thomas Gleixner
2021-11-27  1:23 ` [patch 03/32] genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked() Thomas Gleixner
2021-11-27  1:23 ` [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators Thomas Gleixner
2021-11-28  1:00   ` Jason Gunthorpe
2021-11-28 19:22     ` Thomas Gleixner
2021-11-29  9:26       ` Thomas Gleixner
2021-11-29 14:01         ` Jason Gunthorpe
2021-11-29 14:46           ` Thomas Gleixner
2021-11-27  1:23 ` [patch 05/32] genirq/msi: Provide msi_alloc_msi_desc() and a simple allocator Thomas Gleixner
2021-11-27  1:23 ` [patch 06/32] genirq/msi: Provide domain flags to allocate/free MSI descriptors automatically Thomas Gleixner
2021-11-27  1:23 ` [patch 07/32] genirq/msi: Count the allocated MSI descriptors Thomas Gleixner
2021-11-27 12:19   ` Greg Kroah-Hartman
2021-11-27 19:22     ` Thomas Gleixner
2021-11-27 19:45       ` Thomas Gleixner
2021-11-28 11:07         ` Greg Kroah-Hartman
2021-11-28 19:23           ` Thomas Gleixner
2021-11-27  1:23 ` [patch 08/32] PCI/MSI: Protect MSI operations Thomas Gleixner
2021-11-27  1:23 ` [patch 09/32] PCI/MSI: Use msi_add_msi_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 10/32] PCI/MSI: Let core code free MSI descriptors Thomas Gleixner
2021-11-27  1:23 ` [patch 11/32] PCI/MSI: Use msi_on_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 12/32] x86/pci/xen: Use msi_for_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 13/32] xen/pcifront: Rework MSI handling Thomas Gleixner
2021-11-27  1:23 ` [patch 14/32] s390/pci: Rework MSI descriptor walk Thomas Gleixner
2021-11-29 10:31   ` Niklas Schnelle
2021-11-29 13:04     ` Thomas Gleixner
2021-11-27  1:23 ` [patch 15/32] powerpc/4xx/hsta: Rework MSI handling Thomas Gleixner
2021-11-27  1:23 ` [patch 16/32] powerpc/cell/axon_msi: Convert to msi_on_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 17/32] powerpc/pasemi/msi: Convert to msi_on_each_dec() Thomas Gleixner
2021-11-27  1:23 ` [patch 18/32] powerpc/fsl_msi: Use msi_for_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 19/32] powerpc/mpic_u3msi: Use msi_for_each-desc() Thomas Gleixner
2021-11-27  1:24 ` [patch 20/32] PCI: hv: Rework MSI handling Thomas Gleixner
2021-11-27  1:24 ` [patch 21/32] NTB/msi: Convert to msi_on_each_desc() Thomas Gleixner
2021-11-29 18:21   ` Logan Gunthorpe
2021-11-29 20:51     ` Thomas Gleixner
2021-11-29 22:27       ` Logan Gunthorpe
2021-11-29 22:50         ` Dave Jiang
2021-11-29 23:31         ` Jason Gunthorpe
2021-11-29 23:52           ` Logan Gunthorpe
2021-11-30  0:01             ` Jason Gunthorpe
2021-11-30  0:29         ` Thomas Gleixner [this message]
2021-11-30 19:21           ` Logan Gunthorpe
2021-11-30 19:48             ` Thomas Gleixner
2021-11-30 20:14               ` Logan Gunthorpe
2021-11-30 20:28               ` Jason Gunthorpe
2021-11-30 21:23                 ` Thomas Gleixner
2021-12-01  0:17                   ` Jason Gunthorpe
2021-12-01 10:16                     ` Thomas Gleixner
2021-12-01 13:00                       ` Jason Gunthorpe
2021-12-01 17:35                         ` Thomas Gleixner
2021-12-01 18:14                           ` Jason Gunthorpe
2021-12-01 18:46                             ` Logan Gunthorpe
2021-12-01 20:21                             ` Thomas Gleixner
2021-12-02  0:01                               ` Thomas Gleixner
2021-12-02 13:55                                 ` Jason Gunthorpe
2021-12-02 14:23                                   ` Greg Kroah-Hartman
2021-12-02 14:45                                     ` Jason Gunthorpe
2021-12-02 19:25                                   ` Thomas Gleixner
2021-12-02 20:00                                     ` Jason Gunthorpe
2021-12-02 22:31                                       ` Thomas Gleixner
2021-12-03  0:37                                         ` Jason Gunthorpe
2021-12-03 15:07                                           ` Thomas Gleixner
2021-12-03 16:41                                             ` Jason Gunthorpe
2021-12-04 14:20                                               ` Thomas Gleixner
2021-12-05 14:16                                                 ` Thomas Gleixner
2021-12-06 14:43                                                   ` Jason Gunthorpe
2021-12-06 15:47                                                     ` Thomas Gleixner
2021-12-06 17:00                                                       ` Jason Gunthorpe
2021-12-06 20:28                                                         ` Thomas Gleixner
2021-12-06 21:06                                                           ` Jason Gunthorpe
2021-12-06 22:21                                                             ` Thomas Gleixner
2021-12-06 14:19                                                 ` Jason Gunthorpe
2021-12-06 15:06                                                   ` Thomas Gleixner
2021-12-09  6:26                                               ` Tian, Kevin
2021-12-09  9:03                                                 ` Thomas Gleixner
2021-12-09 12:17                                                   ` Tian, Kevin
2021-12-09 15:57                                                     ` Thomas Gleixner
2021-12-10  7:37                                                       ` Tian, Kevin
2021-12-09  5:41                                   ` Tian, Kevin
2021-12-09  5:47                                     ` Jason Wang
2021-12-01 16:28                       ` Dave Jiang
2021-12-01 18:41                         ` Thomas Gleixner
2021-12-01 18:47                           ` Dave Jiang
2021-12-01 20:25                             ` Thomas Gleixner
2021-12-01 21:21                               ` Dave Jiang
2021-12-01 21:44                                 ` Thomas Gleixner
2021-12-01 21:49                                   ` Dave Jiang
2021-12-01 22:03                                     ` Thomas Gleixner
2021-12-01 22:53                                       ` Dave Jiang
2021-12-01 23:57                                         ` Thomas Gleixner
2021-12-09  5:23                                   ` Tian, Kevin
2021-12-09  8:37                                     ` Thomas Gleixner
2021-12-09 12:31                                       ` Tian, Kevin
2021-12-09 16:21                                       ` Jason Gunthorpe
2021-12-09 20:32                                         ` Thomas Gleixner
2021-12-09 20:58                                           ` Jason Gunthorpe
2021-12-09 22:09                                             ` Thomas Gleixner
2021-12-10  0:26                                               ` Thomas Gleixner
2021-12-10  7:29                                                 ` Tian, Kevin
2021-12-10 12:13                                                   ` Thomas Gleixner
2021-12-11  8:06                                                     ` Tian, Kevin
2021-12-10 12:39                                                   ` Jason Gunthorpe
2021-12-10 19:00                                                     ` Thomas Gleixner
2021-12-11  7:44                                                       ` Tian, Kevin
2021-12-11 13:04                                                         ` Thomas Gleixner
2021-12-12  1:56                                                           ` Tian, Kevin
2021-12-12 20:55                                                             ` Thomas Gleixner
2021-12-12 23:37                                                               ` Jason Gunthorpe
2021-12-13  7:50                                                                 ` Tian, Kevin
2021-12-11  7:52                                                     ` Tian, Kevin
2021-12-12  0:12                                                       ` Thomas Gleixner
2021-12-12  2:14                                                         ` Tian, Kevin
2021-12-12 20:50                                                           ` Thomas Gleixner
2021-12-12 23:42                                                         ` Jason Gunthorpe
2021-12-10  7:36                                             ` Tian, Kevin
2021-12-10 12:30                                               ` Jason Gunthorpe
2021-12-12  6:44                                               ` Mika Penttilä
2021-12-12 23:27                                                 ` Jason Gunthorpe
2021-12-01 14:52                   ` Thomas Gleixner
2021-12-01 15:11                     ` Jason Gunthorpe
2021-12-01 18:37                       ` Thomas Gleixner
2021-12-01 18:47                         ` Jason Gunthorpe
2021-12-01 20:26                           ` Thomas Gleixner
2021-11-27  1:24 ` [patch 22/32] soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation Thomas Gleixner
2021-11-27  1:24 ` [patch 23/32] soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs() Thomas Gleixner
2021-11-27  1:24 ` [patch 24/32] bus: fsl-mc-msi: Simplify MSI descriptor handling Thomas Gleixner
2021-11-27  1:24 ` [patch 25/32] platform-msi: Let core code handle MSI descriptors Thomas Gleixner
2021-11-27  1:24 ` [patch 26/32] platform-msi: Simplify platform device MSI code Thomas Gleixner
2021-11-27  1:24 ` [patch 27/32] genirq/msi: Make interrupt allocation less convoluted Thomas Gleixner
2021-11-27  1:24 ` [patch 28/32] genirq/msi: Convert to new functions Thomas Gleixner
2021-11-27  1:24 ` [patch 29/32] genirq/msi: Mop up old interfaces Thomas Gleixner
2021-11-27  1:24 ` [patch 30/32] genirq/msi: Add abuse prevention comment to msi header Thomas Gleixner
2021-11-27  1:24 ` [patch 31/32] genirq/msi: Simplify sysfs handling Thomas Gleixner
2021-11-27 12:32   ` Greg Kroah-Hartman
2021-11-27 19:31     ` Thomas Gleixner
2021-11-28 11:07       ` Greg Kroah-Hartman
2021-11-28 19:33         ` Thomas Gleixner
2021-11-27  1:24 ` [patch 32/32] genirq/msi: Convert storage to xarray Thomas Gleixner
2021-11-27 12:33   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ilwacwp8.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=alex.williamson@redhat.com \
    --cc=allenbh@gmail.com \
    --cc=ashok.raj@intel.com \
    --cc=borntraeger@de.ibm.com \
    --cc=dave.jiang@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hca@linux.ibm.com \
    --cc=helgaas@kernel.org \
    --cc=jdmason@kudzu.us \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-ntb@googlegroups.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=maz@kernel.org \
    --cc=megha.dey@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).