linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86 multi-msi w/o iommu
@ 2022-04-08 17:34 Jeffrey Hugo
  2022-04-09 20:08 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Jeffrey Hugo @ 2022-04-08 17:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Marc Zyngier, linux-kernel

Hi Thomas,

I'd like to get multi-MSI without IOMMU working on x86.  I'm hoping you 
could help me understand the current state of things, and if there is a 
path toward enabling this.

This is an overly simplistic assessment, but this reportedly works in 
other OSes, so it looks like Linux is broken in comparison.

In my investigation so far, the failure stems from 
x86_vector_alloc_irqs() in arch/x86/kernel/apic/vector.c [1].  If we 
need a contiguous allocation of more than one irq, the allocation 
immediately fails:

	/* Currently vector allocator can't guarantee contiguous allocations */
	if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
		return -ENOSYS;


As I'm sure you are aware, this only impacts MSI without IOMMU as both 
MSI-X and MSI with IOMMU can handle a discontinuous allocation (the flag 
is not set for either of those cases).

That check was added back in 2015 with [2].  In 2017, it looks like you 
refactored the allocator to the irq_matrix component [3].  However, the 
limitation remains to today.

Digging a bit further, it looks like the internal function 
matrix_alloc_area() [4] is capable of doing a contiguous allocation, but 
all the callers of that via the public api hardcode num to 1.  I 
wouldn't say it would be trivial to refactor the irq_matrix public api 
to do a contigious range allocation, and refactor 
x86_vector_alloc_irqs() to do that based on 
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS, but since it seems like that hasn't 
been tackled in 5-7 years (depending on how you look at the history), I 
suspect I'm missing something.

Since I'm poking my nose in an area of the kernel that I haven't gone 
poking into before, I'm guessing there is a lot I don't know.  Would you 
kindly educate me on what the concerns might be?

I look forward to your expertise and opinions.

Thanks

-Jeff



[1]: 
https://elixir.bootlin.com/linux/v5.18-rc1/source/arch/x86/kernel/apic/vector.c#L542
[2]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/apic/vector.c?h=v5.18-rc1&id=b5dc8e6c21e7ffba0246bf39cea97805c142bf85
[3]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/apic/vector.c?h=v5.18-rc1&id=69cde0004a4b5cfc7d1cec4ef9ce4cf4e26142f0
[4]: 
https://elixir.bootlin.com/linux/v5.18-rc1/source/kernel/irq/matrix.c#L110

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: x86 multi-msi w/o iommu
  2022-04-08 17:34 x86 multi-msi w/o iommu Jeffrey Hugo
@ 2022-04-09 20:08 ` Thomas Gleixner
  2022-04-11 16:35   ` Jeffrey Hugo
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2022-04-09 20:08 UTC (permalink / raw)
  To: Jeffrey Hugo; +Cc: Marc Zyngier, linux-kernel

Jeffrey!

On Fri, Apr 08 2022 at 11:34, Jeffrey Hugo wrote:
> I'd like to get multi-MSI without IOMMU working on x86. I'm hoping you
> could help me understand the current state of things, and if there is
> a path toward enabling this.

Hope is definitely required here, but you did not explain yet WHY you
want that to work. Making it work just because is not really a good
starting point.

> This is an overly simplistic assessment, but this reportedly works in 
> other OSes, so it looks like Linux is broken in comparison.
>
> In my investigation so far, the failure stems from
> x86_vector_alloc_irqs() in arch/x86/kernel/apic/vector.c [1].

The failure? That's not a failure, that's a deliberate decision.

Side note: Please spare us the links to random code sites and git web
interfaces. A function name and a commit id is good enough.

> If we need a contiguous allocation of more than one irq, the
> allocation immediately fails:
>
> 	/* Currently vector allocator can't guarantee contiguous allocations */
> 	if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
> 		return -ENOSYS;
>
> As I'm sure you are aware, this only impacts MSI without IOMMU as both 
> MSI-X and MSI with IOMMU can handle a discontinuous allocation (the flag 
> is not set for either of those cases).
>
> That check was added back in 2015 with [2].

X86 never supported multi-MSI in Linux. See:

    commit 1c8d7b0a562d ("PCI MSI: Add support for multiple MSI")

> In 2017, it looks like you refactored the allocator to the irq_matrix
> component [3].  However, the limitation remains to today.

For very good reasons.

> Digging a bit further, it looks like the internal function 
> matrix_alloc_area() [4] is capable of doing a contiguous allocation, but 
> all the callers of that via the public api hardcode num to 1.  I 
> wouldn't say it would be trivial to refactor the irq_matrix public api 
> to do a contigious range allocation, and refactor 
> x86_vector_alloc_irqs() to do that based on 
> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS, but since it seems like that hasn't 
> been tackled in 5-7 years (depending on how you look at the history), I 
> suspect I'm missing something.

There are two fundamental issues with multi-MSI:

   1) General

      a) Multi-MSI is a single MSI message. The device sets the vector
         offset in the lower bits of message data.

      b) As a consequence the Multi-MSI interrupts of a given device are
         all affine to a single target and you cannot set affinity for
         them separately, which limits the usefulness very much.

   2) x86

      a) Due to #1a this requires N consecutive vectors on one CPU and
         these vectors have to be aligned so that the device can set the
         vector offset into the lower bits of message data. The
         alignment requirement depends on the number of vectors and is
         always power of 2 in the range (2, 4, 8, 16, 32).

         Easy to do in theory. But in practice there is the limited
         vector space on x86 (~200) for device interrupts per CPU.

         That creates a problem for allocations in general and for CPU
         hotplug. On hotplug all active interrupts are moved away from
         the outgoing CPU. Due to the alignment requirement this is
         pretty much a guarantee for vector exhaustion.

      b) Changing interrupt affinity for MSI w/o IOMMU is an interesting
         exercise on X86 when the interrupt is not maskable, which is
         unfortunately the case in the majority of MSI hardware
         implementations.

         In that case while the new vectors are installed interrupts can
         be issued by the device. So you need to be very careful _not_
         to lose an interrupt in the case that both the message address
         (the target APIC) and the message data (the vector) are
         changed. You can find the gory details in:

                  arch/x86/kerne/apic/msi.c::msi_set_affinity()

         and the related change logs espescially:
         6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity race")

         Due to #1b changing interrupt affinity has to move _all_
         vectors associated to the device at once, which makes this
         excercise even more interesting.

         This code is already horrible as hell and the thought alone to
         expand it for multi MSI makes me shudder.

Now with interrupt remapping all of the above goes away:

    - The alignment problem and consecutive space issue moves into
      the remap tables which have plenty of space.

    - Each interrupts affinity can be individually controlled because
      the affinity setting happens in the remap table and does not
      require the horrors of the non remapped case.

That means in Multi-MSI can be implemented on x86 w/o remapping, but is
it worth the trouble? From looking at the problem space and under
consideration that the advantage of multi-MSI is very limited the
decision was made to not support it. That's not broken as you claim,
that's a very reasonable technical decision.

From experimentation I know that multi-MSI w/o remapping has a very
limited benefit because it cannot provide scalability through
parallelism obviously. The only benefit is a clear separation of
interrupt functionality which spares a MMIO read to get the pending bits
and a few conditionals. Here is a trivial experiment to demonstrate the
benefit or the lack of it:

    Use a system with interrupt remapping and add a module parameter to
    your driver which allows you to select ONE or MANY MSIs. Then run
    performance tests with both setups.
    
The result might not be what you expect depending on the device and the
resulting interrupt patterns.

The problems of MSI are known since two decades, but hardware folks
still insist on it. MSI-X exists for a reason, but sure it makes the
chips $0.01 more expensive and lots of hardware people still work under
the assumption that whatever they get wrong can be fixed in software,
which is unfortunately wishful thinking.

I hope this clarifies it for you.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: x86 multi-msi w/o iommu
  2022-04-09 20:08 ` Thomas Gleixner
@ 2022-04-11 16:35   ` Jeffrey Hugo
  0 siblings, 0 replies; 3+ messages in thread
From: Jeffrey Hugo @ 2022-04-11 16:35 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Marc Zyngier, linux-kernel

On 4/9/2022 2:08 PM, Thomas Gleixner wrote:
> Jeffrey!
> 
> On Fri, Apr 08 2022 at 11:34, Jeffrey Hugo wrote:
>> I'd like to get multi-MSI without IOMMU working on x86. I'm hoping you
>> could help me understand the current state of things, and if there is
>> a path toward enabling this.
> 
> Hope is definitely required here, but you did not explain yet WHY you
> want that to work. Making it work just because is not really a good
> starting point.

Arguably Linux exists "just because", but its a fair question.

Someone made a request for this.  Trying to figure out the logistics of 
the request before responding to the requestor.

There is a particular device that due to hardware limitations (arguably 
mistakes were made) it only supports MSI and it requires more than 1 MSI 
to work.

While the device is documented as requiring an IOMMU on x86, the 
requesting entity wants to use the device without an IOMMU in a unusual 
setup and pointed it that it works (for some definition of working) on 
other Operating Systems.  So from their viewpoint is not a system issue, 
but a Linux issue.

Granted, I suspect you disagree with this being a Linux issue per the 
below.  I'm trying to understand all of the aspects.

I suspect this is not the level of detail you would like, but from my 
perspective parts of this situation are sensitive so I'm trying to 
proceed as best I can currently (arguably with one hand tied behind my 
back).  To be clear, this level of secrecy is unacceptable when working 
with the community and if I were to move forward to propose changes 
based on the information you've provided, I would be more transparent.

I greatly appreciate your response given the limitations in what I've 
provided.

>> This is an overly simplistic assessment, but this reportedly works in
>> other OSes, so it looks like Linux is broken in comparison.
>>
>> In my investigation so far, the failure stems from
>> x86_vector_alloc_irqs() in arch/x86/kernel/apic/vector.c [1].
> 
> The failure? That's not a failure, that's a deliberate decision.

I'm sorry, I choose my words poorly.  I was viewing things from the 
perspective of reproducing the reported "issue" and seeing why that is 
the case.  The device requests multiple MSIs and doesn't get them, which 
is a "failure" from the perspective that the expectation is things work.

I do however see that this is deliberate per the code, and thus would 
like to understand why.  Fundamentally, that is why I asked you for help.

 From one perspective, its a "failure", from another, its "working as 
intended".  I did not intend any offense, and I'm more interested with 
understanding the "why" than applying a specific label.

> Side note: Please spare us the links to random code sites and git web
> interfaces. A function name and a commit id is good enough.
> 
>> If we need a contiguous allocation of more than one irq, the
>> allocation immediately fails:
>>
>> 	/* Currently vector allocator can't guarantee contiguous allocations */
>> 	if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
>> 		return -ENOSYS;
>>
>> As I'm sure you are aware, this only impacts MSI without IOMMU as both
>> MSI-X and MSI with IOMMU can handle a discontinuous allocation (the flag
>> is not set for either of those cases).
>>
>> That check was added back in 2015 with [2].
> 
> X86 never supported multi-MSI in Linux. See:
> 
>      commit 1c8d7b0a562d ("PCI MSI: Add support for multiple MSI")
> 
>> In 2017, it looks like you refactored the allocator to the irq_matrix
>> component [3].  However, the limitation remains to today.
> 
> For very good reasons.
> 
>> Digging a bit further, it looks like the internal function
>> matrix_alloc_area() [4] is capable of doing a contiguous allocation, but
>> all the callers of that via the public api hardcode num to 1.  I
>> wouldn't say it would be trivial to refactor the irq_matrix public api
>> to do a contigious range allocation, and refactor
>> x86_vector_alloc_irqs() to do that based on
>> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS, but since it seems like that hasn't
>> been tackled in 5-7 years (depending on how you look at the history), I
>> suspect I'm missing something.
> 
> There are two fundamental issues with multi-MSI:
> 
>     1) General
> 
>        a) Multi-MSI is a single MSI message. The device sets the vector
>           offset in the lower bits of message data.
> 
>        b) As a consequence the Multi-MSI interrupts of a given device are
>           all affine to a single target and you cannot set affinity for
>           them separately, which limits the usefulness very much.
> 
>     2) x86
> 
>        a) Due to #1a this requires N consecutive vectors on one CPU and
>           these vectors have to be aligned so that the device can set the
>           vector offset into the lower bits of message data. The
>           alignment requirement depends on the number of vectors and is
>           always power of 2 in the range (2, 4, 8, 16, 32).
> 
>           Easy to do in theory. But in practice there is the limited
>           vector space on x86 (~200) for device interrupts per CPU.
> 
>           That creates a problem for allocations in general and for CPU
>           hotplug. On hotplug all active interrupts are moved away from
>           the outgoing CPU. Due to the alignment requirement this is
>           pretty much a guarantee for vector exhaustion.
> 
>        b) Changing interrupt affinity for MSI w/o IOMMU is an interesting
>           exercise on X86 when the interrupt is not maskable, which is
>           unfortunately the case in the majority of MSI hardware
>           implementations.
> 
>           In that case while the new vectors are installed interrupts can
>           be issued by the device. So you need to be very careful _not_
>           to lose an interrupt in the case that both the message address
>           (the target APIC) and the message data (the vector) are
>           changed. You can find the gory details in:
> 
>                    arch/x86/kerne/apic/msi.c::msi_set_affinity()
> 
>           and the related change logs espescially:
>           6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity race")
> 
>           Due to #1b changing interrupt affinity has to move _all_
>           vectors associated to the device at once, which makes this
>           excercise even more interesting.
> 
>           This code is already horrible as hell and the thought alone to
>           expand it for multi MSI makes me shudder.
> 
> Now with interrupt remapping all of the above goes away:
> 
>      - The alignment problem and consecutive space issue moves into
>        the remap tables which have plenty of space.
> 
>      - Each interrupts affinity can be individually controlled because
>        the affinity setting happens in the remap table and does not
>        require the horrors of the non remapped case.
> 
> That means in Multi-MSI can be implemented on x86 w/o remapping, but is
> it worth the trouble? From looking at the problem space and under
> consideration that the advantage of multi-MSI is very limited the
> decision was made to not support it. That's not broken as you claim,
> that's a very reasonable technical decision.

Again, sorry I choose my words poorly.  I understand the summary of the 
issues which lead to the decision, which I did not previously understand 
before.  I think I need to dig into the references you provided to fully 
comprehend the situation and come to the same conclusion.

I think the above is what I was looking for.  I thank you for 
elaborating on the issues and pointing me toward where I can learn more.

>  From experimentation I know that multi-MSI w/o remapping has a very
> limited benefit because it cannot provide scalability through
> parallelism obviously. The only benefit is a clear separation of
> interrupt functionality which spares a MMIO read to get the pending bits
> and a few conditionals. Here is a trivial experiment to demonstrate the
> benefit or the lack of it:
> 
>      Use a system with interrupt remapping and add a module parameter to
>      your driver which allows you to select ONE or MANY MSIs. Then run
>      performance tests with both setups.
>      
> The result might not be what you expect depending on the device and the
> resulting interrupt patterns.

This is a fine suggestion and if it were possible, I would go forward 
with it rather than investigate understanding the issues and developing 
a solution to address them (under the assumption its possible, theory 
and reality are often different).

Sadly, as mentioned, device limitations prevent operation with just one 
MSI which is why I'm seeking to understand if multi-MSI with no IOMMU is 
possible in Linux.  As you point out below sometimes software is put in 
an unfortunate position ("between a rock and a hard place").  All I can 
do is evaluate the various options and pick the least bad one.  Right 
now, I don't know that $SUBJECT is the best option, but I think I can 
evaluate it and compare based on your assistance.

> The problems of MSI are known since two decades, but hardware folks
> still insist on it. MSI-X exists for a reason, but sure it makes the
> chips $0.01 more expensive and lots of hardware people still work under
> the assumption that whatever they get wrong can be fixed in software,
> which is unfortunately wishful thinking >
> I hope this clarifies it for you.

Yes, very much so.  Thank you!

-Jeff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-04-11 16:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08 17:34 x86 multi-msi w/o iommu Jeffrey Hugo
2022-04-09 20:08 ` Thomas Gleixner
2022-04-11 16:35   ` Jeffrey Hugo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).