All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] x86/irq: Lower unhandled irq error severity
       [not found] <20201126074734.12664-1-lnicola@dend.ro>
@ 2020-11-27  0:12 ` Thomas Gleixner
  2020-11-27  8:03   ` Laurențiu Nicola
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2020-11-27  0:12 UTC (permalink / raw)
  To: Laurențiu Nicola; +Cc: mingo, bp, x86, trivial, LKML

On Thu, Nov 26 2020 at 09:47, Laurențiu Nicola wrote:
> These messages are described as warnings in the MSI code.

Where and what has MSI to do with these messages?

> Spotted because they break quiet boot on a Ryzen 5000 CPU.

They don't break the boot.

The machine boots fine, but having interrupts raised on a vector which
is unused is really bad.

Can you please provide the actual message from dmesg?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-27  0:12 ` [PATCH] x86/irq: Lower unhandled irq error severity Thomas Gleixner
@ 2020-11-27  8:03   ` Laurențiu Nicola
  2020-11-30 16:56     ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Laurențiu Nicola @ 2020-11-27  8:03 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML

Hello,

On Fri, Nov 27, 2020, at 02:12, Thomas Gleixner wrote:
> On Thu, Nov 26 2020 at 09:47, Laurențiu Nicola wrote:
> > These messages are described as warnings in the MSI code.
> 
> Where and what has MSI to do with these messages?

There's a comment referring to it as a warning, but an error seemed a more appropriate severity:

     * If the vector is unused, then it is marked so it won't
     * trigger the 'No irq handler for vector' warning in
     * common_interrupt().

> > Spotted because they break quiet boot on a Ryzen 5000 CPU.
> 
> They don't break the boot.
> 
> The machine boots fine, but having interrupts raised on a vector which
> is unused is really bad.

That's right, sorry. It still boots, but it's no longer "quiet", that's what I meant.

> Can you please provide the actual message from dmesg?

Sure:

[    0.316902] __common_interrupt: 1.55 No irq handler for vector
[    0.316902] __common_interrupt: 2.55 No irq handler for vector
[    0.316902] __common_interrupt: 3.55 No irq handler for vector
[    0.316902] __common_interrupt: 4.55 No irq handler for vector
[    0.316902] __common_interrupt: 5.55 No irq handler for vector
[    0.316902] __common_interrupt: 6.55 No irq handler for vector
[    0.316902] __common_interrupt: 7.55 No irq handler for vector
[    0.316902] __common_interrupt: 8.55 No irq handler for vector
[    0.316902] __common_interrupt: 9.55 No irq handler for vector
[    0.316902] __common_interrupt: 10.55 No irq handler for vector

These only show up during boot (and not e.g. when a disabling and enabling again a CPU).

Laurențiu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-27  8:03   ` Laurențiu Nicola
@ 2020-11-30 16:56     ` Thomas Gleixner
  2020-11-30 17:22       ` Laurențiu Nicola
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2020-11-30 16:56 UTC (permalink / raw)
  To: Laurențiu Nicola; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky

Laurentiu,

On Fri, Nov 27 2020 at 10:03, Laurențiu Nicola wrote:
> On Fri, Nov 27, 2020, at 02:12, Thomas Gleixner wrote:
>> On Thu, Nov 26 2020 at 09:47, Laurențiu Nicola wrote:
>> > These messages are described as warnings in the MSI code.
>> 
>> Where and what has MSI to do with these messages?
>
> There's a comment referring to it as a warning, but an error seemed a more appropriate severity:
>
>      * If the vector is unused, then it is marked so it won't
>      * trigger the 'No irq handler for vector' warning in
>      * common_interrupt().

That's a description for the logic in the MSI code which is required to
_NOT_ trigger the 'No irq handler' message. If that message appears then
something _is_ badly wrong. Either the kernel screwed up or something in
the BIOS/firmware/hardware is bonkers.

>> > Spotted because they break quiet boot on a Ryzen 5000 CPU.
>> 
>> They don't break the boot.
>> 
>> The machine boots fine, but having interrupts raised on a vector which
>> is unused is really bad.
>
> That's right, sorry. It still boots, but it's no longer "quiet",
> that's what I meant.

Right, but surpressing that is not a solution.

>> Can you please provide the actual message from dmesg?
>
> Sure:
>
> [    0.316902] __common_interrupt: 1.55 No irq handler for vector
> [    0.316902] __common_interrupt: 2.55 No irq handler for vector
> [    0.316902] __common_interrupt: 3.55 No irq handler for vector
> [    0.316902] __common_interrupt: 4.55 No irq handler for vector
> [    0.316902] __common_interrupt: 5.55 No irq handler for vector
> [    0.316902] __common_interrupt: 6.55 No irq handler for vector
> [    0.316902] __common_interrupt: 7.55 No irq handler for vector
> [    0.316902] __common_interrupt: 8.55 No irq handler for vector
> [    0.316902] __common_interrupt: 9.55 No irq handler for vector
> [    0.316902] __common_interrupt: 10.55 No irq handler for vector
>
> These only show up during boot (and not e.g. when a disabling and enabling again a CPU).

That's the AMD plague which is known for quite some time and it's pretty
much confirmed that it is a BIOS/firmware issue.

I don't know whether AMD has figured it out and told their OEMs what to
do about that or whether the OEMs just ignore it because windows ignores
it or is not affected for whatever reason.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-30 16:56     ` Thomas Gleixner
@ 2020-11-30 17:22       ` Laurențiu Nicola
  2020-11-30 23:34         ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Laurențiu Nicola @ 2020-11-30 17:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky



On Mon, Nov 30, 2020, at 18:56, Thomas Gleixner wrote:
> Laurentiu,
> 
> On Fri, Nov 27 2020 at 10:03, Laurențiu Nicola wrote:
> > On Fri, Nov 27, 2020, at 02:12, Thomas Gleixner wrote:
> >> On Thu, Nov 26 2020 at 09:47, Laurențiu Nicola wrote:
> >> > These messages are described as warnings in the MSI code.
> >> 
> >> Where and what has MSI to do with these messages?
> >
> > There's a comment referring to it as a warning, but an error seemed a more appropriate severity:
> >
> >      * If the vector is unused, then it is marked so it won't
> >      * trigger the 'No irq handler for vector' warning in
> >      * common_interrupt().
> 
> That's a description for the logic in the MSI code which is required to
> _NOT_ trigger the 'No irq handler' message. If that message appears then
> something _is_ badly wrong. Either the kernel screwed up or something in
> the BIOS/firmware/hardware is bonkers.

Agreed, just pointing out that the MSI code refers to it as a warning (as opposed to a critical error).

> 
> >> > Spotted because they break quiet boot on a Ryzen 5000 CPU.
> >> 
> >> They don't break the boot.
> >> 
> >> The machine boots fine, but having interrupts raised on a vector which
> >> is unused is really bad.
> >
> > That's right, sorry. It still boots, but it's no longer "quiet",
> > that's what I meant.
> 
> Right, but surpressing that is not a solution.

I'm just downgrading it from "emergency" to "error". It will still be displayed for most users snd anyone looking in dmesg. But I'm unlikely to convince my motherboard manufacturer to fix this in the BIOS, and the errors are basically unactionable and uninformative (unlike say "can't set up page mappings" or "your CPU might be on fire" which would really imply a crash soon).

The messages themselves are only a cosmetic issue -- they replace the BIOS logo that would otherwise stay up until the display manager started.

But if you think this should really be an "emerg" message, I'm not going to insist anymore. I'm sure you have more important patches to review :-).

Thanks,
Laurențiu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-30 17:22       ` Laurențiu Nicola
@ 2020-11-30 23:34         ` Thomas Gleixner
  2020-12-01  8:18           ` Laurențiu Nicola
  2020-12-01 14:36           ` Tom Lendacky
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Gleixner @ 2020-11-30 23:34 UTC (permalink / raw)
  To: Laurențiu Nicola; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky

On Mon, Nov 30 2020 at 19:22, Laurențiu Nicola wrote:
> On Mon, Nov 30, 2020, at 18:56, Thomas Gleixner wrote:
>> > That's right, sorry. It still boots, but it's no longer "quiet",
>> > that's what I meant.
>> 
>> Right, but surpressing that is not a solution.
>
> I'm just downgrading it from "emergency" to "error". It will still be
> displayed for most users snd anyone looking in dmesg. But I'm unlikely
> to convince my motherboard manufacturer to fix this in the BIOS, and
> the errors are basically unactionable and uninformative (unlike say
> "can't set up page mappings" or "your CPU might be on fire" which
> would really imply a crash soon).

The point is that for some cases this can result in a non working
machine which just hangs and if it's below the usual loglevel cutoff,
then it's not visible, which is more annoying than a non-quiet boot if
you're affected.

We are looking into a way to mitigate that AMD wreckage, but so far we
don't even know where exactly this comes from. The reason why we are
pretty sure that it is a BIOS/Firmware issue is that some people
reported it to be gone after a BIOS update and quite some machines do
not have this issue at all.

Just for completeness sake. Can you provide the line in /proc/interrupts
for irq 7 on that machine?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-30 23:34         ` Thomas Gleixner
@ 2020-12-01  8:18           ` Laurențiu Nicola
  2020-12-01 10:38             ` Thomas Gleixner
  2020-12-01 14:36           ` Tom Lendacky
  1 sibling, 1 reply; 11+ messages in thread
From: Laurențiu Nicola @ 2020-12-01  8:18 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky

On Tue, Dec 1, 2020, at 01:34, Thomas Gleixner wrote:
> The point is that for some cases this can result in a non working
> machine which just hangs and if it's below the usual loglevel cutoff,
> then it's not visible, which is more annoying than a non-quiet boot if
> you're affected.

For most (desktop) users "errors" will be shown by default, and if anyone is having trouble, they can temporarily remove "quiet" from the kernel command line while debugging it, so it's easy. On the other hand, I don't think it's possible to silence the emergency messages (and I'd still like to see them for any "something is on fire").

The only other use of `pr_emerg_ratelimited` seems to be an informational message shown on non-AMD MCEs ("run mcelog --ascii"). `pr_emerg` is used in more places, but they do sound like critical situations that will bring the system down anyway.

> 
> We are looking into a way to mitigate that AMD wreckage, but so far we
> don't even know where exactly this comes from. The reason why we are
> pretty sure that it is a BIOS/Firmware issue is that some people
> reported it to be gone after a BIOS update and quite some machines do
> not have this issue at all.

In my case, it's latest BIOS version available. Could be AGESA-related, maybe we could install a no-op handler for that IRQ?

> 
> Just for completeness sake. Can you provide the line in /proc/interrupts
> for irq 7 on that machine?


  55:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2625543-edge      xhci_hcd

PS: I now see that this was reported a lot of times, including e.g. https://lkml.org/lkml/2019/3/6/97.

Thanks,
Laurențiu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-12-01  8:18           ` Laurențiu Nicola
@ 2020-12-01 10:38             ` Thomas Gleixner
  2020-12-01 10:41               ` Laurențiu Nicola
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2020-12-01 10:38 UTC (permalink / raw)
  To: Laurențiu Nicola; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky

On Tue, Dec 01 2020 at 10:18, Laurențiu Nicola wrote:
> On Tue, Dec 1, 2020, at 01:34, Thomas Gleixner wrote:
>> Just for completeness sake. Can you provide the line in /proc/interrupts
>> for irq 7 on that machine?
>
>
>   55:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2625543-edge      xhci_hcd
>

IRQ 55 != IRQ 7. I really meant IRQ 7.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-12-01 10:38             ` Thomas Gleixner
@ 2020-12-01 10:41               ` Laurențiu Nicola
  0 siblings, 0 replies; 11+ messages in thread
From: Laurențiu Nicola @ 2020-12-01 10:41 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML, Tom Lendacky

On Tue, Dec 1, 2020, at 12:38, Thomas Gleixner wrote:
> On Tue, Dec 01 2020 at 10:18, Laurențiu Nicola wrote:
> > On Tue, Dec 1, 2020, at 01:34, Thomas Gleixner wrote:
> >> Just for completeness sake. Can you provide the line in /proc/interrupts
> >> for irq 7 on that machine?
> >
> >
> >   55:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 2625543-edge      xhci_hcd
> >
> 
> IRQ 55 != IRQ 7. I really meant IRQ 7.

Sorry, I thought they were numbered differently. I guess this 7?

   7:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC    7-fasteoi   pinctrl_amd

Laurențiu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-11-30 23:34         ` Thomas Gleixner
  2020-12-01  8:18           ` Laurențiu Nicola
@ 2020-12-01 14:36           ` Tom Lendacky
  2020-12-01 14:44             ` Laurențiu Nicola
  1 sibling, 1 reply; 11+ messages in thread
From: Tom Lendacky @ 2020-12-01 14:36 UTC (permalink / raw)
  To: Thomas Gleixner, Laurențiu Nicola; +Cc: mingo, bp, x86, trivial, LKML

On 11/30/20 5:34 PM, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 19:22, Laurențiu Nicola wrote:
>> On Mon, Nov 30, 2020, at 18:56, Thomas Gleixner wrote:
>>>> That's right, sorry. It still boots, but it's no longer "quiet",
>>>> that's what I meant.
>>>
>>> Right, but surpressing that is not a solution.
>>
>> I'm just downgrading it from "emergency" to "error". It will still be
>> displayed for most users snd anyone looking in dmesg. But I'm unlikely
>> to convince my motherboard manufacturer to fix this in the BIOS, and
>> the errors are basically unactionable and uninformative (unlike say
>> "can't set up page mappings" or "your CPU might be on fire" which
>> would really imply a crash soon).
> 
> The point is that for some cases this can result in a non working
> machine which just hangs and if it's below the usual loglevel cutoff,
> then it's not visible, which is more annoying than a non-quiet boot if
> you're affected.
> 
> We are looking into a way to mitigate that AMD wreckage, but so far we
> don't even know where exactly this comes from. The reason why we are
> pretty sure that it is a BIOS/Firmware issue is that some people
> reported it to be gone after a BIOS update and quite some machines do
> not have this issue at all.

Thomas has reported this to me previously and I have reported it to our
BIOS team. That previously reported problem has been fixed in BIOS, but
I'm not sure at what AGESA level the fix will be rolled out.

@Laurențiu, what is the exact model of the processor you are running?

Thanks,
Tom

> 
> Just for completeness sake. Can you provide the line in /proc/interrupts
> for irq 7 on that machine?
> 
> Thanks,
> 
>         tglx
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-12-01 14:36           ` Tom Lendacky
@ 2020-12-01 14:44             ` Laurențiu Nicola
  2020-12-01 17:05               ` Tom Lendacky
  0 siblings, 1 reply; 11+ messages in thread
From: Laurențiu Nicola @ 2020-12-01 14:44 UTC (permalink / raw)
  To: Tom Lendacky, Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML

On Tue, Dec 1, 2020, at 16:36, Tom Lendacky wrote:
> 
> Thomas has reported this to me previously and I have reported it to our
> BIOS team. That previously reported problem has been fixed in BIOS, but
> I'm not sure at what AGESA level the fix will be rolled out.
> 
> @Laurențiu, what is the exact model of the processor you are running?

It's a Ryzen 5950X with a B550 motherboard (AGESA V2 PI 1.1.0.0 Patch C).

Laurențiu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/irq: Lower unhandled irq error severity
  2020-12-01 14:44             ` Laurențiu Nicola
@ 2020-12-01 17:05               ` Tom Lendacky
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Lendacky @ 2020-12-01 17:05 UTC (permalink / raw)
  To: Laurențiu Nicola, Thomas Gleixner; +Cc: mingo, bp, x86, trivial, LKML

On 12/1/20 8:44 AM, Laurențiu Nicola wrote:
> On Tue, Dec 1, 2020, at 16:36, Tom Lendacky wrote:
>>
>> Thomas has reported this to me previously and I have reported it to our
>> BIOS team. That previously reported problem has been fixed in BIOS, but
>> I'm not sure at what AGESA level the fix will be rolled out.
>>
>> @Laurențiu, what is the exact model of the processor you are running?
> 
> It's a Ryzen 5950X with a B550 motherboard (AGESA V2 PI 1.1.0.0 Patch C).

Ok, I will forward on the information.

Thanks,
Tom

> 
> Laurențiu
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-12-01 17:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201126074734.12664-1-lnicola@dend.ro>
2020-11-27  0:12 ` [PATCH] x86/irq: Lower unhandled irq error severity Thomas Gleixner
2020-11-27  8:03   ` Laurențiu Nicola
2020-11-30 16:56     ` Thomas Gleixner
2020-11-30 17:22       ` Laurențiu Nicola
2020-11-30 23:34         ` Thomas Gleixner
2020-12-01  8:18           ` Laurențiu Nicola
2020-12-01 10:38             ` Thomas Gleixner
2020-12-01 10:41               ` Laurențiu Nicola
2020-12-01 14:36           ` Tom Lendacky
2020-12-01 14:44             ` Laurențiu Nicola
2020-12-01 17:05               ` Tom Lendacky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.