All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] x2apic broken with current AMD hardware
@ 2023-03-08 15:23 Elliott Mitchell
  2023-03-08 15:37 ` Andrew Cooper
  2023-03-08 15:50 ` Jan Beulich
  0 siblings, 2 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-08 15:23 UTC (permalink / raw)
  To: xen-devel

Mostly SSIA.  As originally identified by "Neowutran", appears Xen's
x2apic wrapper implementation fails with current generation AMD hardware
(Ryzen 7xxx/Zen 4).  This can be worked around by passing "x2apic=false"
on Xen's command-line, though I'm wondering about the performance impact.

There hasn't been much activity on xen-devel WRT x2apic, so a patch which
fixed this may have flown under the radar.  Most testing has also been
somewhat removed from HEAD.

Thanks to "Neowutran" for falling on this grenade and making it easier
for the followers.  Pointer to first report:
https://forum.qubes-os.org/t/ryzen-7000-serie/14538


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-08 15:23 [BUG] x2apic broken with current AMD hardware Elliott Mitchell
@ 2023-03-08 15:37 ` Andrew Cooper
  2023-03-08 15:50 ` Jan Beulich
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2023-03-08 15:37 UTC (permalink / raw)
  To: Elliott Mitchell, xen-devel

On 08/03/2023 3:23 pm, Elliott Mitchell wrote:
> Mostly SSIA.  As originally identified by "Neowutran", appears Xen's
> x2apic wrapper implementation fails with current generation AMD hardware
> (Ryzen 7xxx/Zen 4).  This can be worked around by passing "x2apic=false"
> on Xen's command-line, though I'm wondering about the performance impact.
>
> There hasn't been much activity on xen-devel WRT x2apic, so a patch which
> fixed this may have flown under the radar.  Most testing has also been
> somewhat removed from HEAD.
>
> Thanks to "Neowutran" for falling on this grenade and making it easier
> for the followers.  Pointer to first report:
> https://forum.qubes-os.org/t/ryzen-7000-serie/14538

I'm sorry, but this is unactionable.

That thread is a mess of many different bugs, most entirely unrelated to
Xen in the slightest, and there is conflicting information on whether
the x2apic=false heuristic is even relevant.

There may be a Xen issue hiding in there, but my divination skill level
is unequal to the task...

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-08 15:23 [BUG] x2apic broken with current AMD hardware Elliott Mitchell
  2023-03-08 15:37 ` Andrew Cooper
@ 2023-03-08 15:50 ` Jan Beulich
  2023-03-08 23:08   ` Elliott Mitchell
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-03-08 15:50 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 08.03.2023 16:23, Elliott Mitchell wrote:
> Mostly SSIA.  As originally identified by "Neowutran", appears Xen's
> x2apic wrapper implementation fails with current generation AMD hardware
> (Ryzen 7xxx/Zen 4).  This can be worked around by passing "x2apic=false"
> on Xen's command-line, though I'm wondering about the performance impact.
> 
> There hasn't been much activity on xen-devel WRT x2apic, so a patch which
> fixed this may have flown under the radar.  Most testing has also been
> somewhat removed from HEAD.
> 
> Thanks to "Neowutran" for falling on this grenade and making it easier
> for the followers.  Pointer to first report:
> https://forum.qubes-os.org/t/ryzen-7000-serie/14538

I'm sorry, but when you point at this long a report, would you please be a
little more specific as to where the problem in question is actually
mentioned? Searching the page for "x2apic" didn't give any hits at all
until I first scrolled to the bottom of the (at present) 95 comments. And
then while there are five mentions, there's nothing I could spot that
would actually help understanding what is actually wrong. A statement like
"... is because the implementation of x2apic is incorrect" isn't helpful
on its own. And a later statement by another person puts under question
whether "x2apic=false" actually helps in all cases.

Please can we get a proper bug report here with suitable technical detail?
Or alternatively a patch to discuss?

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-08 15:50 ` Jan Beulich
@ 2023-03-08 23:08   ` Elliott Mitchell
  2023-03-09  9:03     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-08 23:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, Mar 08, 2023 at 04:50:51PM +0100, Jan Beulich wrote:
> On 08.03.2023 16:23, Elliott Mitchell wrote:
> > Mostly SSIA.  As originally identified by "Neowutran", appears Xen's
> > x2apic wrapper implementation fails with current generation AMD hardware
> > (Ryzen 7xxx/Zen 4).  This can be worked around by passing "x2apic=false"
> > on Xen's command-line, though I'm wondering about the performance impact.
> > 
> > There hasn't been much activity on xen-devel WRT x2apic, so a patch which
> > fixed this may have flown under the radar.  Most testing has also been
> > somewhat removed from HEAD.
> > 
> > Thanks to "Neowutran" for falling on this grenade and making it easier
> > for the followers.  Pointer to first report:
> > https://forum.qubes-os.org/t/ryzen-7000-serie/14538
> 
> I'm sorry, but when you point at this long a report, would you please be a
> little more specific as to where the problem in question is actually
> mentioned? Searching the page for "x2apic" didn't give any hits at all
> until I first scrolled to the bottom of the (at present) 95 comments. And
> then while there are five mentions, there's nothing I could spot that
> would actually help understanding what is actually wrong. A statement like
> "... is because the implementation of x2apic is incorrect" isn't helpful
> on its own. And a later statement by another person puts under question
> whether "x2apic=false" actually helps in all cases.
> 
> Please can we get a proper bug report here with suitable technical detail?
> Or alternatively a patch to discuss?

Mostly I was pointing to the thread to credit Neowutran and company with
originally finding the workaround.  I'm concerned about how
representative my reproduction is since the computer in question is
presently using Debian's build of Xen, 4.14.

As such I'm less than certain the problem is still in HEAD, though
Neowutran and Co working with 4.16 and the commit log being quiet
suggests there is a good chance.

More detail, pretty well most things are broken for Domain 0 without
"x2apic=false".  Trying to boot with a 6.1.12 a USB keyboard was
completely unresponsive, on screen the initial ramdisk script output was
indicating problems interacting with storage devices.  Those two together
suggested an interrupt issue and adding "x2apic=false" caused domain 0 to
successfully boot.
A 5.10 kernel similarly requires "x2apic=false" to successfully boot.

So could be a commit after 4.16 fixed x2apic for current AMD hardware,
but may still be broken.

I sent the message out of concern Neowutran got attention to the TSC
overflow issue, but I haven't seen any mention of the x2apic issue.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-08 23:08   ` Elliott Mitchell
@ 2023-03-09  9:03     ` Jan Beulich
  2023-03-10  3:25       ` Elliott Mitchell
  2023-03-11  0:09       ` Elliott Mitchell
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2023-03-09  9:03 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 09.03.2023 00:08, Elliott Mitchell wrote:
> On Wed, Mar 08, 2023 at 04:50:51PM +0100, Jan Beulich wrote:
>> On 08.03.2023 16:23, Elliott Mitchell wrote:
>>> Mostly SSIA.  As originally identified by "Neowutran", appears Xen's
>>> x2apic wrapper implementation fails with current generation AMD hardware
>>> (Ryzen 7xxx/Zen 4).  This can be worked around by passing "x2apic=false"
>>> on Xen's command-line, though I'm wondering about the performance impact.
>>>
>>> There hasn't been much activity on xen-devel WRT x2apic, so a patch which
>>> fixed this may have flown under the radar.  Most testing has also been
>>> somewhat removed from HEAD.
>>>
>>> Thanks to "Neowutran" for falling on this grenade and making it easier
>>> for the followers.  Pointer to first report:
>>> https://forum.qubes-os.org/t/ryzen-7000-serie/14538
>>
>> I'm sorry, but when you point at this long a report, would you please be a
>> little more specific as to where the problem in question is actually
>> mentioned? Searching the page for "x2apic" didn't give any hits at all
>> until I first scrolled to the bottom of the (at present) 95 comments. And
>> then while there are five mentions, there's nothing I could spot that
>> would actually help understanding what is actually wrong. A statement like
>> "... is because the implementation of x2apic is incorrect" isn't helpful
>> on its own. And a later statement by another person puts under question
>> whether "x2apic=false" actually helps in all cases.
>>
>> Please can we get a proper bug report here with suitable technical detail?
>> Or alternatively a patch to discuss?
> 
> Mostly I was pointing to the thread to credit Neowutran and company with
> originally finding the workaround.  I'm concerned about how
> representative my reproduction is since the computer in question is
> presently using Debian's build of Xen, 4.14.
> 
> As such I'm less than certain the problem is still in HEAD, though
> Neowutran and Co working with 4.16 and the commit log being quiet
> suggests there is a good chance.
> 
> More detail, pretty well most things are broken for Domain 0 without
> "x2apic=false".  Trying to boot with a 6.1.12 a USB keyboard was
> completely unresponsive, on screen the initial ramdisk script output was
> indicating problems interacting with storage devices.  Those two together
> suggested an interrupt issue and adding "x2apic=false" caused domain 0 to
> successfully boot.
> A 5.10 kernel similarly requires "x2apic=false" to successfully boot.
> 
> So could be a commit after 4.16 fixed x2apic for current AMD hardware,
> but may still be broken.

If Dom0 boot is affected, trying a newer hypervisor shouldn't be a problem.
You won't need any of the toolstack to match just to see whether Dom0 boots.

In any event you will want to collect a serial log at maximum verbosity.
It would also be of interest to know whether turning off the IOMMU avoids
the issue as well (on the assumption that your system has less than 255
CPUs).

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-09  9:03     ` Jan Beulich
@ 2023-03-10  3:25       ` Elliott Mitchell
  2023-03-10  8:22         ` Jan Beulich
  2023-03-11  0:09       ` Elliott Mitchell
  1 sibling, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-10  3:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> On 09.03.2023 00:08, Elliott Mitchell wrote:
> > 
> > As such I'm less than certain the problem is still in HEAD, though
> > Neowutran and Co working with 4.16 and the commit log being quiet
> > suggests there is a good chance.
> > 
> > More detail, pretty well most things are broken for Domain 0 without
> > "x2apic=false".  Trying to boot with a 6.1.12 a USB keyboard was
> > completely unresponsive, on screen the initial ramdisk script output was
> > indicating problems interacting with storage devices.  Those two together
> > suggested an interrupt issue and adding "x2apic=false" caused domain 0 to
> > successfully boot.
> > A 5.10 kernel similarly requires "x2apic=false" to successfully boot.
> > 
> > So could be a commit after 4.16 fixed x2apic for current AMD hardware,
> > but may still be broken.
> 
> If Dom0 boot is affected, trying a newer hypervisor shouldn't be a problem.
> You won't need any of the toolstack to match just to see whether Dom0 boots.
> 
> In any event you will want to collect a serial log at maximum verbosity.
> It would also be of interest to know whether turning off the IOMMU avoids
> the issue as well (on the assumption that your system has less than 255
> CPUs).

Well, I can now state "x2apic=false" IS required for Xen 4.17.  Since the
last x2apic commit was about a year ago, I believe this matches HEAD.  I
missed the logs since the USB-serial adapter decided to bugger when the
machine rebooted.

Is it just me or is https://wiki.xenproject.org/wiki/Xen_Serial_Console
out of date?

Last time I used serial debugging I recall the options being different.
Due to stress level this time I'm not so favorable to looking things up
in Git (this system is crucial in my development workflow, so being in a
problematic state is trouble).


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-10  3:25       ` Elliott Mitchell
@ 2023-03-10  8:22         ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-03-10  8:22 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 10.03.2023 04:25, Elliott Mitchell wrote:
> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>> On 09.03.2023 00:08, Elliott Mitchell wrote:
>>>
>>> As such I'm less than certain the problem is still in HEAD, though
>>> Neowutran and Co working with 4.16 and the commit log being quiet
>>> suggests there is a good chance.
>>>
>>> More detail, pretty well most things are broken for Domain 0 without
>>> "x2apic=false".  Trying to boot with a 6.1.12 a USB keyboard was
>>> completely unresponsive, on screen the initial ramdisk script output was
>>> indicating problems interacting with storage devices.  Those two together
>>> suggested an interrupt issue and adding "x2apic=false" caused domain 0 to
>>> successfully boot.
>>> A 5.10 kernel similarly requires "x2apic=false" to successfully boot.
>>>
>>> So could be a commit after 4.16 fixed x2apic for current AMD hardware,
>>> but may still be broken.
>>
>> If Dom0 boot is affected, trying a newer hypervisor shouldn't be a problem.
>> You won't need any of the toolstack to match just to see whether Dom0 boots.
>>
>> In any event you will want to collect a serial log at maximum verbosity.
>> It would also be of interest to know whether turning off the IOMMU avoids
>> the issue as well (on the assumption that your system has less than 255
>> CPUs).
> 
> Well, I can now state "x2apic=false" IS required for Xen 4.17.  Since the
> last x2apic commit was about a year ago, I believe this matches HEAD.  I
> missed the logs since the USB-serial adapter decided to bugger when the
> machine rebooted.
> 
> Is it just me or is https://wiki.xenproject.org/wiki/Xen_Serial_Console
> out of date?

Without you being more specific (there are many variants there), from
quickly glancing over it the command line options mentioned look right to
me. There's an oddity like "console=tty0", but that's not affecting serial
(should be "console=vga", if at all - in the specific example the
subsequent "console=com2" overrides the bogus one anyway).

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-09  9:03     ` Jan Beulich
  2023-03-10  3:25       ` Elliott Mitchell
@ 2023-03-11  0:09       ` Elliott Mitchell
  2023-03-13  7:01         ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-11  0:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> 
> In any event you will want to collect a serial log at maximum verbosity.
> It would also be of interest to know whether turning off the IOMMU avoids
> the issue as well (on the assumption that your system has less than 255
> CPUs).

I think I might have figured out the situation in a different fashion.

I was taking a look at the BIOS manual for this motherboard and noticed
a mention of a "Local APIC Mode" setting.  Four values are listed
"Compatibility", "xAPIC", "x2APIC", and "Auto".

That is the sort of setting I likely left at "Auto" and that may well
result in x2 functionality being disabled.  Perhaps the x2APIC
functionality on AMD is detecting whether the hardware is present, and
failing to test whether it has been enabled?  (could be useful to output
a message suggesting enabling the hardware feature)


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-11  0:09       ` Elliott Mitchell
@ 2023-03-13  7:01         ` Jan Beulich
  2023-03-16 22:03           ` Elliott Mitchell
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-03-13  7:01 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 11.03.2023 01:09, Elliott Mitchell wrote:
> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>
>> In any event you will want to collect a serial log at maximum verbosity.
>> It would also be of interest to know whether turning off the IOMMU avoids
>> the issue as well (on the assumption that your system has less than 255
>> CPUs).
> 
> I think I might have figured out the situation in a different fashion.
> 
> I was taking a look at the BIOS manual for this motherboard and noticed
> a mention of a "Local APIC Mode" setting.  Four values are listed
> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> 
> That is the sort of setting I likely left at "Auto" and that may well
> result in x2 functionality being disabled.  Perhaps the x2APIC
> functionality on AMD is detecting whether the hardware is present, and
> failing to test whether it has been enabled?  (could be useful to output
> a message suggesting enabling the hardware feature)

Can we please move to a little more technical terms here? What is "present"
and "enabled" in your view? I don't suppose you mean the CPUID bit (which
we check) and the x2APIC-mode-enable one (which we drive as needed). It's
also left unclear what the four modes of BIOS operation evaluate to. Even
if we knew that, overriding e.g. "Compatibility" (which likely means some
form of "disabled" / "hidden") isn't normally an appropriate thing to do.
In "Auto" mode Xen likely should work - the only way I could interpret the
the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
speculation on my part ...

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-13  7:01         ` Jan Beulich
@ 2023-03-16 22:03           ` Elliott Mitchell
  2023-03-17  8:22             ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-16 22:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> On 11.03.2023 01:09, Elliott Mitchell wrote:
> > On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>
> >> In any event you will want to collect a serial log at maximum verbosity.
> >> It would also be of interest to know whether turning off the IOMMU avoids
> >> the issue as well (on the assumption that your system has less than 255
> >> CPUs).
> > 
> > I think I might have figured out the situation in a different fashion.
> > 
> > I was taking a look at the BIOS manual for this motherboard and noticed
> > a mention of a "Local APIC Mode" setting.  Four values are listed
> > "Compatibility", "xAPIC", "x2APIC", and "Auto".
> > 
> > That is the sort of setting I likely left at "Auto" and that may well
> > result in x2 functionality being disabled.  Perhaps the x2APIC
> > functionality on AMD is detecting whether the hardware is present, and
> > failing to test whether it has been enabled?  (could be useful to output
> > a message suggesting enabling the hardware feature)
> 
> Can we please move to a little more technical terms here? What is "present"
> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> also left unclear what the four modes of BIOS operation evaluate to. Even
> if we knew that, overriding e.g. "Compatibility" (which likely means some
> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> In "Auto" mode Xen likely should work - the only way I could interpret the
> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> speculation on my part ...

I provided the information I had discovered.  There is a setting for this
motherboard (likely present on some similar motherboards) which /may/
effect the issue.  I doubt I've tried "compatibility", but none of the
values I've tried have gotten the system to boot without "x2apic=false"
on Xen's command-line.

When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
I see the line "(XEN) - x2APIC".  Later is the line
"(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
leaves the x2APIC turned off since neither line is present.

Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
present (so perhaps "Auto" merely doesn't activate it).

Appears error_interrupt() needs locking or some concurrency handling
mechanism since the last error is jumbled.  With the setting "x2APIC"
I get a bunch of:
"(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
(apparently one for each core)
Followed by "Receive accept error, Receive accept error," (again,
apparently one for each core).  Then a bunch of newlines (same pattern).

With the setting "auto" the last message is a series of
"(XEN) CPU#: No irq handler for vector ## (IRQ -2147483648, LAPIC)" on
2 different cores.  Rather more of the lines were from one core, the
vector value varied some.

Notable both sets of final error messages appeared after the Domain 0
kernel thought it had been operating for >30 seconds.  Lack of
response to interrupt generating events (pressing keys on USB keyboard)
suggests interrupts weren't getting through.


With "x2apic=false" error messages similar to the "Local APIC Mode"
of "x2APIC" appear >45 seconds after Domain 0 kernel start.  Of note
first "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
appears for all cores with "Receive accept error,".

Yet later a variation on this message starts appearing:
"(XEN) APIC error on CPU#: 08(08)(XEN) APIC error on CPU#: 08(08)"
this one appears multiple times.


If one was to want full logs, the lack of secure communications channel
would be an issue (since filtering out identifying data is difficult).
DSA-3072 with SHA2-256 is now less than wonderful, but DSA-1024 and
ElGamal 2048 are right out.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-16 22:03           ` Elliott Mitchell
@ 2023-03-17  8:22             ` Jan Beulich
  2023-03-17 17:26               ` Elliott Mitchell
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-03-17  8:22 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 16.03.2023 23:03, Elliott Mitchell wrote:
> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>
>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>> the issue as well (on the assumption that your system has less than 255
>>>> CPUs).
>>>
>>> I think I might have figured out the situation in a different fashion.
>>>
>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>
>>> That is the sort of setting I likely left at "Auto" and that may well
>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>> functionality on AMD is detecting whether the hardware is present, and
>>> failing to test whether it has been enabled?  (could be useful to output
>>> a message suggesting enabling the hardware feature)
>>
>> Can we please move to a little more technical terms here? What is "present"
>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>> also left unclear what the four modes of BIOS operation evaluate to. Even
>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>> In "Auto" mode Xen likely should work - the only way I could interpret the
>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>> speculation on my part ...
> 
> I provided the information I had discovered.  There is a setting for this
> motherboard (likely present on some similar motherboards) which /may/
> effect the issue.  I doubt I've tried "compatibility", but none of the
> values I've tried have gotten the system to boot without "x2apic=false"
> on Xen's command-line.
> 
> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> I see the line "(XEN) - x2APIC".  Later is the line
> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> leaves the x2APIC turned off since neither line is present.

When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
mode. Are you sure that's the case when using "Auto"?

> Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
> present (so perhaps "Auto" merely doesn't activate it).

Did you also try "x2apic_phys" on the Xen command line (just to be sure
this isn't a clustered-mode only issue)?

> Appears error_interrupt() needs locking or some concurrency handling
> mechanism since the last error is jumbled.  With the setting "x2APIC"
> I get a bunch of:
> "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> (apparently one for each core)
> Followed by "Receive accept error, Receive accept error," (again,
> apparently one for each core).  Then a bunch of newlines (same pattern).

This is a known issue, but since the messages shouldn't appear in the
first place so far no-one has bothered to address this.

> With the setting "auto" the last message is a series of
> "(XEN) CPU#: No irq handler for vector ## (IRQ -2147483648, LAPIC)" on
> 2 different cores.  Rather more of the lines were from one core, the
> vector value varied some.
> 
> Notable both sets of final error messages appeared after the Domain 0
> kernel thought it had been operating for >30 seconds.  Lack of
> response to interrupt generating events (pressing keys on USB keyboard)
> suggests interrupts weren't getting through.
> 
> 
> With "x2apic=false" error messages similar to the "Local APIC Mode"
> of "x2APIC" appear >45 seconds after Domain 0 kernel start.  Of note
> first "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> appears for all cores with "Receive accept error,".
> 
> Yet later a variation on this message starts appearing:
> "(XEN) APIC error on CPU#: 08(08)(XEN) APIC error on CPU#: 08(08)"
> this one appears multiple times.

That's certainly odd, as it suggests that things were okay for a short
while.

> If one was to want full logs, the lack of secure communications channel
> would be an issue (since filtering out identifying data is difficult).
> DSA-3072 with SHA2-256 is now less than wonderful, but DSA-1024 and
> ElGamal 2048 are right out.

I think how good or bad my publicly available key is doesn't matter here
at all. You're not asked to provide the logs to me or any other
individual; you're asked to provide the logs to the community, such that
anyone interested may take a stab at addressing the issue. Eventual
patches may also want to refer to (parts of) such logs, which wouldn't
be possible if they weren't to be public in the first place.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-17  8:22             ` Jan Beulich
@ 2023-03-17 17:26               ` Elliott Mitchell
  2023-03-20  8:14                 ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-17 17:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> On 16.03.2023 23:03, Elliott Mitchell wrote:
> > On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>
> >>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>> the issue as well (on the assumption that your system has less than 255
> >>>> CPUs).
> >>>
> >>> I think I might have figured out the situation in a different fashion.
> >>>
> >>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>
> >>> That is the sort of setting I likely left at "Auto" and that may well
> >>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>> functionality on AMD is detecting whether the hardware is present, and
> >>> failing to test whether it has been enabled?  (could be useful to output
> >>> a message suggesting enabling the hardware feature)
> >>
> >> Can we please move to a little more technical terms here? What is "present"
> >> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >> also left unclear what the four modes of BIOS operation evaluate to. Even
> >> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >> In "Auto" mode Xen likely should work - the only way I could interpret the
> >> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >> speculation on my part ...
> > 
> > I provided the information I had discovered.  There is a setting for this
> > motherboard (likely present on some similar motherboards) which /may/
> > effect the issue.  I doubt I've tried "compatibility", but none of the
> > values I've tried have gotten the system to boot without "x2apic=false"
> > on Xen's command-line.
> > 
> > When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> > I see the line "(XEN) - x2APIC".  Later is the line
> > "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> > leaves the x2APIC turned off since neither line is present.
> 
> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> mode. Are you sure that's the case when using "Auto"?

grep -eAPIC\ driver -e-\ x2APIC:

"Auto":
(XEN) Using APIC driver default
(XEN) Overriding APIC driver with bigsmp
(XEN) Switched to APIC driver x2apic_cluster

"x2APIC":
(XEN) Using APIC driver x2apic_cluster
(XEN) - x2APIC

Yes, I'm sure.

> > Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
> > present (so perhaps "Auto" merely doesn't activate it).
> 
> Did you also try "x2apic_phys" on the Xen command line (just to be sure
> this isn't a clustered-mode only issue)?

No.  In fact x2apic_cluster is mentioned in all failure cases.

> > Appears error_interrupt() needs locking or some concurrency handling
> > mechanism since the last error is jumbled.  With the setting "x2APIC"
> > I get a bunch of:
> > "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> > (apparently one for each core)
> > Followed by "Receive accept error, Receive accept error," (again,
> > apparently one for each core).  Then a bunch of newlines (same pattern).
> 
> This is a known issue, but since the messages shouldn't appear in the
> first place so far no-one has bothered to address this.

I won't claim it is the best solution, but see other message.

I'm tempted to propose allowing _Static_assert() since it is valuable
functionality for preventing issues.

> > With the setting "auto" the last message is a series of
> > "(XEN) CPU#: No irq handler for vector ## (IRQ -2147483648, LAPIC)" on
> > 2 different cores.  Rather more of the lines were from one core, the
> > vector value varied some.
> > 
> > Notable both sets of final error messages appeared after the Domain 0
> > kernel thought it had been operating for >30 seconds.  Lack of
> > response to interrupt generating events (pressing keys on USB keyboard)
> > suggests interrupts weren't getting through.
> > 
> > 
> > With "x2apic=false" error messages similar to the "Local APIC Mode"
> > of "x2APIC" appear >45 seconds after Domain 0 kernel start.  Of note
> > first "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> > appears for all cores with "Receive accept error,".
> > 
> > Yet later a variation on this message starts appearing:
> > "(XEN) APIC error on CPU#: 08(08)(XEN) APIC error on CPU#: 08(08)"
> > this one appears multiple times.
> 
> That's certainly odd, as it suggests that things were okay for a short
> while.

Note, "later" means further down the log.  Upon checking this could mean
almost immediately after.  There are a total of 6 "APIC error" lines
(first with "00(08)", rest with "08(08)") and the lines with timestamps
indicate no more than 2 seconds between them.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-17 17:26               ` Elliott Mitchell
@ 2023-03-20  8:14                 ` Jan Beulich
  2023-03-20  8:28                   ` Jan Beulich
  2023-03-20 15:37                   ` Elliott Mitchell
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2023-03-20  8:14 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 17.03.2023 18:26, Elliott Mitchell wrote:
> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
>> On 16.03.2023 23:03, Elliott Mitchell wrote:
>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>>>
>>>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>>>> the issue as well (on the assumption that your system has less than 255
>>>>>> CPUs).
>>>>>
>>>>> I think I might have figured out the situation in a different fashion.
>>>>>
>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>>>
>>>>> That is the sort of setting I likely left at "Auto" and that may well
>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>>>> functionality on AMD is detecting whether the hardware is present, and
>>>>> failing to test whether it has been enabled?  (could be useful to output
>>>>> a message suggesting enabling the hardware feature)
>>>>
>>>> Can we please move to a little more technical terms here? What is "present"
>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>>>> speculation on my part ...
>>>
>>> I provided the information I had discovered.  There is a setting for this
>>> motherboard (likely present on some similar motherboards) which /may/
>>> effect the issue.  I doubt I've tried "compatibility", but none of the
>>> values I've tried have gotten the system to boot without "x2apic=false"
>>> on Xen's command-line.
>>>
>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
>>> I see the line "(XEN) - x2APIC".  Later is the line
>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
>>> leaves the x2APIC turned off since neither line is present.
>>
>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
>> mode. Are you sure that's the case when using "Auto"?
> 
> grep -eAPIC\ driver -e-\ x2APIC:
> 
> "Auto":
> (XEN) Using APIC driver default
> (XEN) Overriding APIC driver with bigsmp
> (XEN) Switched to APIC driver x2apic_cluster
> 
> "x2APIC":
> (XEN) Using APIC driver x2apic_cluster
> (XEN) - x2APIC
> 
> Yes, I'm sure.

Okay, this then means we're running in a mode we don't mean to run
in: When the IOMMU claims to not support x2APIC mode (which is odd in
the first place when at the same time the CPU reports x2APIC mode as
supported), amd_iommu_prepare() is intended to switch interrupt
remapping mode to "restricted" (which in turn would force x2APIC mode
to "physical", not "clustered"). I notice though that there are a
number of error paths in the function which bypass this setting. Could
you add a couple of printk()s to understand which path is taken (each
time; the function can be called more than once)?

>>> Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
>>> present (so perhaps "Auto" merely doesn't activate it).
>>
>> Did you also try "x2apic_phys" on the Xen command line (just to be sure
>> this isn't a clustered-mode only issue)?
> 
> No.  In fact x2apic_cluster is mentioned in all failure cases.

Could you give physical mode a try, please?

>>> Appears error_interrupt() needs locking or some concurrency handling
>>> mechanism since the last error is jumbled.  With the setting "x2APIC"
>>> I get a bunch of:
>>> "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
>>> (apparently one for each core)
>>> Followed by "Receive accept error, Receive accept error," (again,
>>> apparently one for each core).  Then a bunch of newlines (same pattern).
>>
>> This is a known issue, but since the messages shouldn't appear in the
>> first place so far no-one has bothered to address this.
> 
> I won't claim it is the best solution, but see other message.
> 
> I'm tempted to propose allowing _Static_assert() since it is valuable
> functionality for preventing issues.

How does _Static_assert() come into play here? Also note that we already
use it when available ...

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:14                 ` Jan Beulich
@ 2023-03-20  8:28                   ` Jan Beulich
  2023-03-20 17:50                     ` Andrew Cooper
                                       ` (3 more replies)
  2023-03-20 15:37                   ` Elliott Mitchell
  1 sibling, 4 replies; 24+ messages in thread
From: Jan Beulich @ 2023-03-20  8:28 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 20.03.2023 09:14, Jan Beulich wrote:
> On 17.03.2023 18:26, Elliott Mitchell wrote:
>> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
>>> On 16.03.2023 23:03, Elliott Mitchell wrote:
>>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>>>>
>>>>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>>>>> the issue as well (on the assumption that your system has less than 255
>>>>>>> CPUs).
>>>>>>
>>>>>> I think I might have figured out the situation in a different fashion.
>>>>>>
>>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>>>>
>>>>>> That is the sort of setting I likely left at "Auto" and that may well
>>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>>>>> functionality on AMD is detecting whether the hardware is present, and
>>>>>> failing to test whether it has been enabled?  (could be useful to output
>>>>>> a message suggesting enabling the hardware feature)
>>>>>
>>>>> Can we please move to a little more technical terms here? What is "present"
>>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
>>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
>>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>>>>> speculation on my part ...
>>>>
>>>> I provided the information I had discovered.  There is a setting for this
>>>> motherboard (likely present on some similar motherboards) which /may/
>>>> effect the issue.  I doubt I've tried "compatibility", but none of the
>>>> values I've tried have gotten the system to boot without "x2apic=false"
>>>> on Xen's command-line.
>>>>
>>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
>>>> I see the line "(XEN) - x2APIC".  Later is the line
>>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
>>>> leaves the x2APIC turned off since neither line is present.
>>>
>>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
>>> mode. Are you sure that's the case when using "Auto"?
>>
>> grep -eAPIC\ driver -e-\ x2APIC:
>>
>> "Auto":
>> (XEN) Using APIC driver default
>> (XEN) Overriding APIC driver with bigsmp
>> (XEN) Switched to APIC driver x2apic_cluster
>>
>> "x2APIC":
>> (XEN) Using APIC driver x2apic_cluster
>> (XEN) - x2APIC
>>
>> Yes, I'm sure.
> 
> Okay, this then means we're running in a mode we don't mean to run
> in: When the IOMMU claims to not support x2APIC mode (which is odd in
> the first place when at the same time the CPU reports x2APIC mode as
> supported), amd_iommu_prepare() is intended to switch interrupt
> remapping mode to "restricted" (which in turn would force x2APIC mode
> to "physical", not "clustered"). I notice though that there are a
> number of error paths in the function which bypass this setting. Could
> you add a couple of printk()s to understand which path is taken (each
> time; the function can be called more than once)?

I think I've spotted at least one issue. Could you give the patch below
a try please? (Patch is fine for master and 4.17 but would need context
adjustment for 4.16.)

Jan

AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode

An earlier change with the same title (commit 1ba66a870eba) altered only
the path where x2apic_phys was already set to false (perhaps from the
command line). The same of course needs applying when the variable
wasn't modified yet from its initial value.

Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- unstable.orig/xen/arch/x86/genapic/x2apic.c
+++ unstable/xen/arch/x86/genapic/x2apic.c
@@ -236,11 +236,11 @@ const struct genapic *__init apic_x2apic
     if ( x2apic_phys < 0 )
     {
         /*
-         * Force physical mode if there's no interrupt remapping support: The
-         * ID in clustered mode requires a 32 bit destination field due to
+         * Force physical mode if there's no (full) interrupt remapping support:
+         * The ID in clustered mode requires a 32 bit destination field due to
          * the usage of the high 16 bits to hold the cluster ID.
          */
-        x2apic_phys = !iommu_intremap ||
+        x2apic_phys = iommu_intremap != iommu_intremap_full ||
                       (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL) ||
                       (IS_ENABLED(CONFIG_X2APIC_PHYSICAL) &&
                        !(acpi_gbl_FADT.flags & ACPI_FADT_APIC_CLUSTER));




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:14                 ` Jan Beulich
  2023-03-20  8:28                   ` Jan Beulich
@ 2023-03-20 15:37                   ` Elliott Mitchell
  2023-03-20 15:52                     ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-20 15:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Neowutran

[-- Attachment #1: Type: text/plain, Size: 6585 bytes --]

On Mon, Mar 20, 2023 at 09:14:17AM +0100, Jan Beulich wrote:
> On 17.03.2023 18:26, Elliott Mitchell wrote:
> > On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> >> On 16.03.2023 23:03, Elliott Mitchell wrote:
> >>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>>>
> >>>>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>>>> the issue as well (on the assumption that your system has less than 255
> >>>>>> CPUs).
> >>>>>
> >>>>> I think I might have figured out the situation in a different fashion.
> >>>>>
> >>>>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>>>
> >>>>> That is the sort of setting I likely left at "Auto" and that may well
> >>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>>>> functionality on AMD is detecting whether the hardware is present, and
> >>>>> failing to test whether it has been enabled?  (could be useful to output
> >>>>> a message suggesting enabling the hardware feature)
> >>>>
> >>>> Can we please move to a little more technical terms here? What is "present"
> >>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >>>> also left unclear what the four modes of BIOS operation evaluate to. Even
> >>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >>>> In "Auto" mode Xen likely should work - the only way I could interpret the
> >>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >>>> speculation on my part ...
> >>>
> >>> I provided the information I had discovered.  There is a setting for this
> >>> motherboard (likely present on some similar motherboards) which /may/
> >>> effect the issue.  I doubt I've tried "compatibility", but none of the
> >>> values I've tried have gotten the system to boot without "x2apic=false"
> >>> on Xen's command-line.
> >>>
> >>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> >>> I see the line "(XEN) - x2APIC".  Later is the line
> >>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> >>> leaves the x2APIC turned off since neither line is present.
> >>
> >> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> >> mode. Are you sure that's the case when using "Auto"?
> > 
> > grep -eAPIC\ driver -e-\ x2APIC:
> > 
> > "Auto":
> > (XEN) Using APIC driver default
> > (XEN) Overriding APIC driver with bigsmp
> > (XEN) Switched to APIC driver x2apic_cluster
> > 
> > "x2APIC":
> > (XEN) Using APIC driver x2apic_cluster
> > (XEN) - x2APIC
> > 
> > Yes, I'm sure.
> 
> Okay, this then means we're running in a mode we don't mean to run
> in: When the IOMMU claims to not support x2APIC mode (which is odd in
> the first place when at the same time the CPU reports x2APIC mode as
> supported), amd_iommu_prepare() is intended to switch interrupt
> remapping mode to "restricted" (which in turn would force x2APIC mode
> to "physical", not "clustered"). I notice though that there are a
> number of error paths in the function which bypass this setting. Could
> you add a couple of printk()s to understand which path is taken (each
> time; the function can be called more than once)?

If I'm reading the logs right, this could sugggest "Local APIC Mode"
setting was modifying the IOMMU side of the APIC, not the processor side
of the APIC setting.

There is also a "Compatibility" value which I haven't tried.  Perhaps
taking a look would be interesting.

> >>> Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
> >>> present (so perhaps "Auto" merely doesn't activate it).
> >>
> >> Did you also try "x2apic_phys" on the Xen command line (just to be sure
> >> this isn't a clustered-mode only issue)?
> > 
> > No.  In fact x2apic_cluster is mentioned in all failure cases.
> 
> Could you give physical mode a try, please?

I had taken the previous message as an implicit request to do so.  I was
stating I had not previously done so.  While "x2apic=false" is functional
as a workaround, I really would like to get to the bottom of this.

> >>> Appears error_interrupt() needs locking or some concurrency handling
> >>> mechanism since the last error is jumbled.  With the setting "x2APIC"
> >>> I get a bunch of:
> >>> "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> >>> (apparently one for each core)
> >>> Followed by "Receive accept error, Receive accept error," (again,
> >>> apparently one for each core).  Then a bunch of newlines (same pattern).
> >>
> >> This is a known issue, but since the messages shouldn't appear in the
> >> first place so far no-one has bothered to address this.
> > 
> > I won't claim it is the best solution, but see other message.
> > 
> > I'm tempted to propose allowing _Static_assert() since it is valuable
> > functionality for preventing issues.
> 
> How does _Static_assert() come into play here? Also note that we already
> use it when available ...

This is more in relation to the patch.  Appears GCC's C90 mode disables
_Static_assert(), so the _Static_assert(ARRAY_SIZE(args) == 8) had to be
dropped.


Come to think of it, I should have copied Neowutran on this thread as
everyone else has been running into this too.  Perhaps "x2apic=false" has
gotten less attention since it is a single-shot workaround.  Whereas
"tsc_mode = 1" needs to be repeated in every domain's configuration.


Neowutran, I'm unsure whether you remain subscribed to this list or not.
As such I'm attaching a copy of Jan Beulich's proposed patch, which may
(not yet tested) solve the "x2apic=false" issue.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



[-- Attachment #2: x2apic.patch --]
[-- Type: text/x-diff, Size: 1413 bytes --]

Date: Mon, 20 Mar 2023 09:28:20 +0100
From: Jan Beulich <jbeulich@suse.com>

AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode

An earlier change with the same title (commit 1ba66a870eba) altered only
the path where x2apic_phys was already set to false (perhaps from the
command line). The same of course needs applying when the variable
wasn't modified yet from its initial value.

Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- unstable.orig/xen/arch/x86/genapic/x2apic.c
+++ unstable/xen/arch/x86/genapic/x2apic.c
@@ -236,11 +236,11 @@ const struct genapic *__init apic_x2apic
     if ( x2apic_phys < 0 )
     {
         /*
-         * Force physical mode if there's no interrupt remapping support: The
-         * ID in clustered mode requires a 32 bit destination field due to
+         * Force physical mode if there's no (full) interrupt remapping support:
+         * The ID in clustered mode requires a 32 bit destination field due to
          * the usage of the high 16 bits to hold the cluster ID.
          */
-        x2apic_phys = !iommu_intremap ||
+        x2apic_phys = iommu_intremap != iommu_intremap_full ||
                       (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL) ||
                       (IS_ENABLED(CONFIG_X2APIC_PHYSICAL) &&
                        !(acpi_gbl_FADT.flags & ACPI_FADT_APIC_CLUSTER));



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20 15:37                   ` Elliott Mitchell
@ 2023-03-20 15:52                     ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-03-20 15:52 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel, Neowutran

On 20.03.2023 16:37, Elliott Mitchell wrote:
> On Mon, Mar 20, 2023 at 09:14:17AM +0100, Jan Beulich wrote:
>> On 17.03.2023 18:26, Elliott Mitchell wrote:
>>> I'm tempted to propose allowing _Static_assert() since it is valuable
>>> functionality for preventing issues.
>>
>> How does _Static_assert() come into play here? Also note that we already
>> use it when available ...
> 
> This is more in relation to the patch.  Appears GCC's C90 mode disables
> _Static_assert(), so the _Static_assert(ARRAY_SIZE(args) == 8) had to be
> dropped.

I'm puzzled by this. It's been for a long time that we've been building
with -std=gnu99. Plus you simply open-coded BUILD_BUG_ON() - if you had
used it, it would have taken care of the necessary abstraction for you
anyway.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:28                   ` Jan Beulich
@ 2023-03-20 17:50                     ` Andrew Cooper
  2023-03-21  4:19                     ` Elliott Mitchell
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2023-03-20 17:50 UTC (permalink / raw)
  To: Jan Beulich, Elliott Mitchell; +Cc: xen-devel

On 20/03/2023 8:28 am, Jan Beulich wrote:
> On 20.03.2023 09:14, Jan Beulich wrote:
>> On 17.03.2023 18:26, Elliott Mitchell wrote:
>>> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
>>>> On 16.03.2023 23:03, Elliott Mitchell wrote:
>>>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>>>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>>>>>> the issue as well (on the assumption that your system has less than 255
>>>>>>>> CPUs).
>>>>>>> I think I might have figured out the situation in a different fashion.
>>>>>>>
>>>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>>>>>
>>>>>>> That is the sort of setting I likely left at "Auto" and that may well
>>>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>>>>>> functionality on AMD is detecting whether the hardware is present, and
>>>>>>> failing to test whether it has been enabled?  (could be useful to output
>>>>>>> a message suggesting enabling the hardware feature)
>>>>>> Can we please move to a little more technical terms here? What is "present"
>>>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>>>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>>>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
>>>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>>>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>>>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
>>>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>>>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>>>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>>>>>> speculation on my part ...
>>>>> I provided the information I had discovered.  There is a setting for this
>>>>> motherboard (likely present on some similar motherboards) which /may/
>>>>> effect the issue.  I doubt I've tried "compatibility", but none of the
>>>>> values I've tried have gotten the system to boot without "x2apic=false"
>>>>> on Xen's command-line.
>>>>>
>>>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
>>>>> I see the line "(XEN) - x2APIC".  Later is the line
>>>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
>>>>> leaves the x2APIC turned off since neither line is present.
>>>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
>>>> mode. Are you sure that's the case when using "Auto"?
>>> grep -eAPIC\ driver -e-\ x2APIC:
>>>
>>> "Auto":
>>> (XEN) Using APIC driver default
>>> (XEN) Overriding APIC driver with bigsmp
>>> (XEN) Switched to APIC driver x2apic_cluster
>>>
>>> "x2APIC":
>>> (XEN) Using APIC driver x2apic_cluster
>>> (XEN) - x2APIC
>>>
>>> Yes, I'm sure.
>> Okay, this then means we're running in a mode we don't mean to run
>> in: When the IOMMU claims to not support x2APIC mode (which is odd in
>> the first place when at the same time the CPU reports x2APIC mode as
>> supported), amd_iommu_prepare() is intended to switch interrupt
>> remapping mode to "restricted" (which in turn would force x2APIC mode
>> to "physical", not "clustered"). I notice though that there are a
>> number of error paths in the function which bypass this setting. Could
>> you add a couple of printk()s to understand which path is taken (each
>> time; the function can be called more than once)?
> I think I've spotted at least one issue. Could you give the patch below
> a try please? (Patch is fine for master and 4.17 but would need context
> adjustment for 4.16.)
>
> Jan
>
> AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode
>
> An earlier change with the same title (commit 1ba66a870eba) altered only
> the path where x2apic_phys was already set to false (perhaps from the
> command line). The same of course needs applying when the variable
> wasn't modified yet from its initial value.
>
> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I think it's worth saying that for diagnosing purposes, if
x2apic_phys=true also resolves the issue, then it is likely related to this.

~Andrew

>
> --- unstable.orig/xen/arch/x86/genapic/x2apic.c
> +++ unstable/xen/arch/x86/genapic/x2apic.c
> @@ -236,11 +236,11 @@ const struct genapic *__init apic_x2apic
>      if ( x2apic_phys < 0 )
>      {
>          /*
> -         * Force physical mode if there's no interrupt remapping support: The
> -         * ID in clustered mode requires a 32 bit destination field due to
> +         * Force physical mode if there's no (full) interrupt remapping support:
> +         * The ID in clustered mode requires a 32 bit destination field due to
>           * the usage of the high 16 bits to hold the cluster ID.
>           */
> -        x2apic_phys = !iommu_intremap ||
> +        x2apic_phys = iommu_intremap != iommu_intremap_full ||
>                        (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL) ||
>                        (IS_ENABLED(CONFIG_X2APIC_PHYSICAL) &&
>                         !(acpi_gbl_FADT.flags & ACPI_FADT_APIC_CLUSTER));
>
>
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:28                   ` Jan Beulich
  2023-03-20 17:50                     ` Andrew Cooper
@ 2023-03-21  4:19                     ` Elliott Mitchell
  2023-03-21  7:13                       ` Jan Beulich
  2023-03-21 14:34                       ` Neowutran
  2023-03-21 10:46                     ` Marek Marczykowski-Górecki
  2023-04-30 17:16                     ` Elliott Mitchell
  3 siblings, 2 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-21  4:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Neowutran

On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote:
> On 20.03.2023 09:14, Jan Beulich wrote:
> > On 17.03.2023 18:26, Elliott Mitchell wrote:
> >> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> >>> On 16.03.2023 23:03, Elliott Mitchell wrote:
> >>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>>>>
> >>>>>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>>>>> the issue as well (on the assumption that your system has less than 255
> >>>>>>> CPUs).
> >>>>>>
> >>>>>> I think I might have figured out the situation in a different fashion.
> >>>>>>
> >>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>>>>
> >>>>>> That is the sort of setting I likely left at "Auto" and that may well
> >>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>>>>> functionality on AMD is detecting whether the hardware is present, and
> >>>>>> failing to test whether it has been enabled?  (could be useful to output
> >>>>>> a message suggesting enabling the hardware feature)
> >>>>>
> >>>>> Can we please move to a little more technical terms here? What is "present"
> >>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
> >>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
> >>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >>>>> speculation on my part ...
> >>>>
> >>>> I provided the information I had discovered.  There is a setting for this
> >>>> motherboard (likely present on some similar motherboards) which /may/
> >>>> effect the issue.  I doubt I've tried "compatibility", but none of the
> >>>> values I've tried have gotten the system to boot without "x2apic=false"
> >>>> on Xen's command-line.
> >>>>
> >>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> >>>> I see the line "(XEN) - x2APIC".  Later is the line
> >>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> >>>> leaves the x2APIC turned off since neither line is present.
> >>>
> >>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> >>> mode. Are you sure that's the case when using "Auto"?
> >>
> >> grep -eAPIC\ driver -e-\ x2APIC:
> >>
> >> "Auto":
> >> (XEN) Using APIC driver default
> >> (XEN) Overriding APIC driver with bigsmp
> >> (XEN) Switched to APIC driver x2apic_cluster
> >>
> >> "x2APIC":
> >> (XEN) Using APIC driver x2apic_cluster
> >> (XEN) - x2APIC
> >>
> >> Yes, I'm sure.
> > 
> > Okay, this then means we're running in a mode we don't mean to run
> > in: When the IOMMU claims to not support x2APIC mode (which is odd in
> > the first place when at the same time the CPU reports x2APIC mode as
> > supported), amd_iommu_prepare() is intended to switch interrupt
> > remapping mode to "restricted" (which in turn would force x2APIC mode
> > to "physical", not "clustered"). I notice though that there are a
> > number of error paths in the function which bypass this setting. Could
> > you add a couple of printk()s to understand which path is taken (each
> > time; the function can be called more than once)?
> 
> I think I've spotted at least one issue. Could you give the patch below
> a try please? (Patch is fine for master and 4.17 but would need context
> adjustment for 4.16.)


> AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode
> 
> An earlier change with the same title (commit 1ba66a870eba) altered only
> the path where x2apic_phys was already set to false (perhaps from the
> command line). The same of course needs applying when the variable
> wasn't modified yet from its initial value.
> 
> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This does appear to be an improvement.  With this the system boots if
the "Local APIC Mode" setting is "auto".  As you may have guessed,
"(XEN) Switched to APIC driver x2apic_phys".



When I tried setting "Local APIC Mode" to "x2APIC" though things didn't
go so well.  Sometime >15 seconds after Domain 0 boots, first:

"(XEN) APIC error on CPU#: 00(08), Receive accept error" (looks to be
for every core)

Then:
"(XEN) APIC error on CPU#: 08(08), Receive accept error" (again for
every core, but *after* the above has appeared for all cores)

The above appears about twice for each core, then I start seeing
"(XEN) CPU#: No irq handler for vector ?? (IRQ -2147483648, LAPIC)"

The core doesn't vary too much with this, but the vector varies some.

Upon looking "(XEN) Using APIC driver x2apic_cluster".  Unfortunately
I didn't try booting with x2apic_phys forced with this setting.

So x2apic_cluster is looking like a <ahem> on recent AMD processors.


I'm unsure this qualifies as "Tested-by".  Certainly it IS an
improvement, but the problem certainly isn't 100% solved.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-21  4:19                     ` Elliott Mitchell
@ 2023-03-21  7:13                       ` Jan Beulich
  2023-03-21 22:19                         ` Elliott Mitchell
  2023-03-21 14:34                       ` Neowutran
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-03-21  7:13 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel, Neowutran

On 21.03.2023 05:19, Elliott Mitchell wrote:
> On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote:
>> AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode
>>
>> An earlier change with the same title (commit 1ba66a870eba) altered only
>> the path where x2apic_phys was already set to false (perhaps from the
>> command line). The same of course needs applying when the variable
>> wasn't modified yet from its initial value.
>>
>> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> This does appear to be an improvement.  With this the system boots if
> the "Local APIC Mode" setting is "auto".  As you may have guessed,
> "(XEN) Switched to APIC driver x2apic_phys".
> 
> 
> 
> When I tried setting "Local APIC Mode" to "x2APIC" though things didn't
> go so well.  Sometime >15 seconds after Domain 0 boots, first:
> 
> "(XEN) APIC error on CPU#: 00(08), Receive accept error" (looks to be
> for every core)
> 
> Then:
> "(XEN) APIC error on CPU#: 08(08), Receive accept error" (again for
> every core, but *after* the above has appeared for all cores)

Receive accept errors generally mean a bad vector was received, yet the
sending side deemed it fine. That could be a bad I/O APIC RTE, a bad MSI
message data value, or a bad translation thereof into an IRTE (albeit
iirc we never alter the vector).

> The above appears about twice for each core, then I start seeing
> "(XEN) CPU#: No irq handler for vector ?? (IRQ -2147483648, LAPIC)"
> 
> The core doesn't vary too much with this, but the vector varies some.
> 
> Upon looking "(XEN) Using APIC driver x2apic_cluster".  Unfortunately
> I didn't try booting with x2apic_phys forced with this setting.

My guess is that this would also help. But the system should still work
correctly in clustered mode. As a first step I guess debug key 'i', 'z',
and 'M' output may provide some insight. But the request for a full log
at maximum verbosity also remains (ideally with a debug hypervisor).

> So x2apic_cluster is looking like a <ahem> on recent AMD processors.
> 
> 
> I'm unsure this qualifies as "Tested-by".  Certainly it IS an
> improvement, but the problem certainly isn't 100% solved.

There simply are multiple problems; one looks to be solved now.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:28                   ` Jan Beulich
  2023-03-20 17:50                     ` Andrew Cooper
  2023-03-21  4:19                     ` Elliott Mitchell
@ 2023-03-21 10:46                     ` Marek Marczykowski-Górecki
  2023-04-30 17:16                     ` Elliott Mitchell
  3 siblings, 0 replies; 24+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-03-21 10:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Elliott Mitchell, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6079 bytes --]

On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote:
> On 20.03.2023 09:14, Jan Beulich wrote:
> > On 17.03.2023 18:26, Elliott Mitchell wrote:
> >> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> >>> On 16.03.2023 23:03, Elliott Mitchell wrote:
> >>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>>>>
> >>>>>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>>>>> the issue as well (on the assumption that your system has less than 255
> >>>>>>> CPUs).
> >>>>>>
> >>>>>> I think I might have figured out the situation in a different fashion.
> >>>>>>
> >>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>>>>
> >>>>>> That is the sort of setting I likely left at "Auto" and that may well
> >>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>>>>> functionality on AMD is detecting whether the hardware is present, and
> >>>>>> failing to test whether it has been enabled?  (could be useful to output
> >>>>>> a message suggesting enabling the hardware feature)
> >>>>>
> >>>>> Can we please move to a little more technical terms here? What is "present"
> >>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
> >>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
> >>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >>>>> speculation on my part ...
> >>>>
> >>>> I provided the information I had discovered.  There is a setting for this
> >>>> motherboard (likely present on some similar motherboards) which /may/
> >>>> effect the issue.  I doubt I've tried "compatibility", but none of the
> >>>> values I've tried have gotten the system to boot without "x2apic=false"
> >>>> on Xen's command-line.
> >>>>
> >>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> >>>> I see the line "(XEN) - x2APIC".  Later is the line
> >>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> >>>> leaves the x2APIC turned off since neither line is present.
> >>>
> >>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> >>> mode. Are you sure that's the case when using "Auto"?
> >>
> >> grep -eAPIC\ driver -e-\ x2APIC:
> >>
> >> "Auto":
> >> (XEN) Using APIC driver default
> >> (XEN) Overriding APIC driver with bigsmp
> >> (XEN) Switched to APIC driver x2apic_cluster
> >>
> >> "x2APIC":
> >> (XEN) Using APIC driver x2apic_cluster
> >> (XEN) - x2APIC
> >>
> >> Yes, I'm sure.
> > 
> > Okay, this then means we're running in a mode we don't mean to run
> > in: When the IOMMU claims to not support x2APIC mode (which is odd in
> > the first place when at the same time the CPU reports x2APIC mode as
> > supported), amd_iommu_prepare() is intended to switch interrupt
> > remapping mode to "restricted" (which in turn would force x2APIC mode
> > to "physical", not "clustered"). I notice though that there are a
> > number of error paths in the function which bypass this setting. Could
> > you add a couple of printk()s to understand which path is taken (each
> > time; the function can be called more than once)?
> 
> I think I've spotted at least one issue. Could you give the patch below
> a try please? (Patch is fine for master and 4.17 but would need context
> adjustment for 4.16.)

A college has similar issue, where an AMD system hangs during PV dom0
boot, seems like (at least) nvme's interrupts are not delivered. Setting
x2apic_phys=true solves the issue, which hopefully means the patch below
will help too (unfortunately can't test it right now). It's Xen 4.14.

> Jan
> 
> AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode
> 
> An earlier change with the same title (commit 1ba66a870eba) altered only
> the path where x2apic_phys was already set to false (perhaps from the
> command line). The same of course needs applying when the variable
> wasn't modified yet from its initial value.
> 
> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- unstable.orig/xen/arch/x86/genapic/x2apic.c
> +++ unstable/xen/arch/x86/genapic/x2apic.c
> @@ -236,11 +236,11 @@ const struct genapic *__init apic_x2apic
>      if ( x2apic_phys < 0 )
>      {
>          /*
> -         * Force physical mode if there's no interrupt remapping support: The
> -         * ID in clustered mode requires a 32 bit destination field due to
> +         * Force physical mode if there's no (full) interrupt remapping support:
> +         * The ID in clustered mode requires a 32 bit destination field due to
>           * the usage of the high 16 bits to hold the cluster ID.
>           */
> -        x2apic_phys = !iommu_intremap ||
> +        x2apic_phys = iommu_intremap != iommu_intremap_full ||
>                        (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL) ||
>                        (IS_ENABLED(CONFIG_X2APIC_PHYSICAL) &&
>                         !(acpi_gbl_FADT.flags & ACPI_FADT_APIC_CLUSTER));
> 
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-21  4:19                     ` Elliott Mitchell
  2023-03-21  7:13                       ` Jan Beulich
@ 2023-03-21 14:34                       ` Neowutran
  1 sibling, 0 replies; 24+ messages in thread
From: Neowutran @ 2023-03-21 14:34 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: Jan Beulich, xen-devel, Neowutran

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/p lain; charset=utf-8, Size: 1350 bytes --]

> > >>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
> > >>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
> > >>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> 
> This does appear to be an improvement.  With this the system boots if
> the "Local APIC Mode" setting is "auto".  As you may have guessed,
> "(XEN) Switched to APIC driver x2apic_phys".
> 
> 
> 
> When I tried setting "Local APIC Mode" to "x2APIC" though things didn't
> go so well.  

I confirm what Elliott said, just tested the patch on my computer (ryzen 7000 series). 

Before, I was using the workaround "x2apic=false" in the xen boot
option. 

After applying the patch, when "Local APIC Mode" is set to "auto" ( or
at least not "x2APIC"), then it work without needing the "x2apic=false"
workaround. 
If "Local APIC Mode" is set to "X2APIC", then it is stuck waiting for
nvme \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-21  7:13                       ` Jan Beulich
@ 2023-03-21 22:19                         ` Elliott Mitchell
  0 siblings, 0 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-03-21 22:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Neowutran

On Tue, Mar 21, 2023 at 08:13:15AM +0100, Jan Beulich wrote:
> On 21.03.2023 05:19, Elliott Mitchell wrote:
> 
> > The above appears about twice for each core, then I start seeing
> > "(XEN) CPU#: No irq handler for vector ?? (IRQ -2147483648, LAPIC)"
> > 
> > The core doesn't vary too much with this, but the vector varies some.
> > 
> > Upon looking "(XEN) Using APIC driver x2apic_cluster".  Unfortunately
> > I didn't try booting with x2apic_phys forced with this setting.
> 
> My guess is that this would also help. But the system should still work
> correctly in clustered mode. As a first step I guess debug key 'i', 'z',
> and 'M' output may provide some insight. But the request for a full log
> at maximum verbosity also remains (ideally with a debug hypervisor).

Needs a secure channel (PGP most likely) for everything, since there is
too much information in there.  Serial numbers and MAC addresses are a
potential source of attack (or faking returns).  Smaller segments can be
had more readily:

(XEN) SMBIOS 3.5 present.
(XEN) x2APIC mode is already enabled by BIOS.
(XEN) Using APIC driver x2apic_cluster
(XEN) ACPI: PM-Timer IO Port: 0x808 (32 bits)
(XEN) ACPI: v5 SLEEP INFO: control[0:0], status[0:0]
(XEN) ACPI: SLEEP INFO: pm1x_cnt[1:804,1:0], pm1x_evt[1:800,1:0]
(XEN) ACPI: 32/64X FACS address mismatch in FADT - 785a3000/0000000000000000, using 32
(XEN) ACPI:             wakeup_vec[785a300c], vec_size[20]
(XEN) ACPI: Local APIC address 0xfee00000
(XEN) ACPI: IOAPIC (id[0x20] address[0xfec00000] gsi_base[0])
(XEN) IOAPIC[0]: apic_id 32, version 33, address 0xfec00000, GSI 0-23
(XEN) ACPI: IOAPIC (id[0x21] address[0xfec01000] gsi_base[24])
(XEN) IOAPIC[1]: apic_id 33, version 33, address 0xfec01000, GSI 24-55
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
(XEN) ACPI: HPET id: 0x######## base: 0xfed00000
(XEN) Using ACPI (MADT) for SMP configuration information
(XEN) IRQ limits: 56 GSI, 6600 MSI/MSI-X
(XEN) AMD-Vi: IOMMU Extended Features:
(XEN) - Peripheral Page Service Request
(XEN) - x2APIC
(XEN) - NX bit
(XEN) - Guest APIC Physical Processor Interrupt
(XEN) - Invalidate All Command
(XEN) - Guest APIC
(XEN) - Performance Counters
(XEN) - Host Address Translation Size: 0x2
(XEN) - Guest Address Translation Size: 0
(XEN) - Guest CR3 Root Table Level: 0x1
(XEN) - Maximum PASID: 0xf
(XEN) - SMI Filter Register: 0x1
(XEN) - SMI Filter Register Count: 0x1
(XEN) - Guest Virtual APIC Modes: 0x1
(XEN) - Dual PPR Log: 0x2
(XEN) - Dual Event Log: 0x2
(XEN) - Secure ATS
(XEN) - User / Supervisor Page Protection
(XEN) - Device Table Segmentation: 0x3
(XEN) - PPR Log Overflow Early Warning
(XEN) - PPR Automatic Response
(XEN) - Memory Access Routing and Control: 0x1
(XEN) - Block StopMark Message
(XEN) - Performance Optimization
(XEN) - MSI Capability MMIO Access
(XEN) - Guest I/O Protection
(XEN) - Enhanced PPR Handling
(XEN) - Invalidate IOTLB Type
(XEN) - VM Table Size: 0x2
(XEN) - Guest Access Bit Update Disable
(XEN) AMD-Vi: Disabled HAP memory map sharing with IOMMU
(XEN) AMD-Vi: IOMMU 0 Enabled.

I'm a bit concerned how all the reports so far are ASUS motherboards.
This could mean people getting the latest, greatest and using Xen tend
towards ASUS motherboards.  This could also mean ASUS's engineering team
did something to their latest round.  Both are possible.

Could well be the latest round from AMD include more pieces from their
server processors, which trigger x2apic_cluster as default.  Yet didn't
bring some extra portion(s) which are required by x2apic_cluster.


> > So x2apic_cluster is looking like a <ahem> on recent AMD processors.
> > 
> > 
> > I'm unsure this qualifies as "Tested-by".  Certainly it IS an
> > improvement, but the problem certainly isn't 100% solved.
> 
> There simply are multiple problems; one looks to be solved now.

I agree with that assessment.  Just I'm unsure whether this step is
enough to include "Tested-by".  I'm concerned there could be a single
deeper problem which solves everything at once.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-03-20  8:28                   ` Jan Beulich
                                       ` (2 preceding siblings ...)
  2023-03-21 10:46                     ` Marek Marczykowski-Górecki
@ 2023-04-30 17:16                     ` Elliott Mitchell
  2023-05-02  5:59                       ` Jan Beulich
  3 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-04-30 17:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote:
> On 20.03.2023 09:14, Jan Beulich wrote:
> > On 17.03.2023 18:26, Elliott Mitchell wrote:
> >> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> >>> On 16.03.2023 23:03, Elliott Mitchell wrote:
> >>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>>>>
> >>>>>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>>>>> the issue as well (on the assumption that your system has less than 255
> >>>>>>> CPUs).
> >>>>>>
> >>>>>> I think I might have figured out the situation in a different fashion.
> >>>>>>
> >>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>>>>
> >>>>>> That is the sort of setting I likely left at "Auto" and that may well
> >>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>>>>> functionality on AMD is detecting whether the hardware is present, and
> >>>>>> failing to test whether it has been enabled?  (could be useful to output
> >>>>>> a message suggesting enabling the hardware feature)
> >>>>>
> >>>>> Can we please move to a little more technical terms here? What is "present"
> >>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
> >>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
> >>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >>>>> speculation on my part ...
> >>>>
> >>>> I provided the information I had discovered.  There is a setting for this
> >>>> motherboard (likely present on some similar motherboards) which /may/
> >>>> effect the issue.  I doubt I've tried "compatibility", but none of the
> >>>> values I've tried have gotten the system to boot without "x2apic=false"
> >>>> on Xen's command-line.
> >>>>
> >>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> >>>> I see the line "(XEN) - x2APIC".  Later is the line
> >>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> >>>> leaves the x2APIC turned off since neither line is present.
> >>>
> >>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> >>> mode. Are you sure that's the case when using "Auto"?
> >>
> >> grep -eAPIC\ driver -e-\ x2APIC:
> >>
> >> "Auto":
> >> (XEN) Using APIC driver default
> >> (XEN) Overriding APIC driver with bigsmp
> >> (XEN) Switched to APIC driver x2apic_cluster
> >>
> >> "x2APIC":
> >> (XEN) Using APIC driver x2apic_cluster
> >> (XEN) - x2APIC
> >>
> >> Yes, I'm sure.
> > 
> > Okay, this then means we're running in a mode we don't mean to run
> > in: When the IOMMU claims to not support x2APIC mode (which is odd in
> > the first place when at the same time the CPU reports x2APIC mode as
> > supported), amd_iommu_prepare() is intended to switch interrupt
> > remapping mode to "restricted" (which in turn would force x2APIC mode
> > to "physical", not "clustered"). I notice though that there are a
> > number of error paths in the function which bypass this setting. Could
> > you add a couple of printk()s to understand which path is taken (each
> > time; the function can be called more than once)?
> 
> I think I've spotted at least one issue. Could you give the patch below
> a try please? (Patch is fine for master and 4.17 but would need context
> adjustment for 4.16.)

Given the patch didn't fix the problem, that wasn't the issue.  I did
though manage to try another variant of BIOS settings for this
motherboard.  Setting "Local APIC Mode" to "x2APIC" in the BIOS neither
breaks anything additional, nor fixes issues.  What was in Xen's dmesg
did change slightly and looks likely better for my purposes.  Some more
snippets from 4.17 Xen dmesg, with "x2apic_phys=true":

(XEN) AMD-Vi: IOMMU Extended Features:
(XEN) - Peripheral Page Service Request
(XEN) - x2APIC
(XEN) - NX bit
(XEN) - Guest APIC Physical Processor Interrupt
(XEN) - Invalidate All Command
(XEN) - Guest APIC
(XEN) - Performance Counters
(XEN) - Host Address Translation Size: 0x2
(XEN) - Guest Address Translation Size: 0
(XEN) - Guest CR3 Root Table Level: 0x1
(XEN) - Maximum PASID: 0xf
(XEN) - SMI Filter Register: 0x1
(XEN) - SMI Filter Register Count: 0x1
(XEN) - Guest Virtual APIC Modes: 0x1
(XEN) - Dual PPR Log: 0x2
(XEN) - Dual Event Log: 0x2
(XEN) - Secure ATS
(XEN) - User / Supervisor Page Protection
(XEN) - Device Table Segmentation: 0x3
(XEN) - PPR Log Overflow Early Warning
(XEN) - PPR Automatic Response
(XEN) - Memory Access Routing and Control: 0x1
(XEN) - Block StopMark Message
(XEN) - Performance Optimization
(XEN) - MSI Capability MMIO Access
(XEN) - Guest I/O Protection
(XEN) - Enhanced PPR Handling
(XEN) - Invalidate IOTLB Type
(XEN) - VM Table Size: 0x2
(XEN) - Guest Access Bit Update Disable
(XEN) AMD-Vi: Disabled HAP memory map sharing with IOMMU
(XEN) AMD-Vi: IOMMU 0 Enabled.


(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) nr_sockets: 1
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) Enabling APIC mode:  Physical.  Using 2 I/O APICs
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using old ACK method


(XEN) SVM: Supported advanced features:
(XEN)  - Nested Page Tables (NPT)
(XEN)  - Last Branch Record (LBR) Virtualisation
(XEN)  - Next-RIP Saved on #VMEXIT
(XEN)  - VMCB Clean Bits
(XEN)  - DecodeAssists
(XEN)  - Virtual VMLOAD/VMSAVE
(XEN)  - Virtual GIF
(XEN)  - Pause-Intercept Filter
(XEN)  - Pause-Intercept Filter Threshold
(XEN)  - TSC Rate MSR
(XEN)  - NPT Supervisor Shadow Stack
(XEN)  - MSR_SPEC_CTRL virtualisation
(XEN) HVM: SVM enabled

If I'm reading that correctly, everything is there for x2APIC.  As such
there seem to be 1 or 2 bugs:

The definite bug is the x2apic_cluster APIC driver fails on recent AMD
processors.

I'm unsure whether selecting the x2apic_cluster APIC driver is correct or
not.  Capabilities you used to only find a multi-socket server
motherboards are now being found on desktop motherboards.  My
understanding is this processor does NUMA on a single die, not merely a
single-socket.  As such it may well need the features of x2apic_cluster,
perhaps the driver assumes nr_socket > 1 which is untrue here?

Does appear "x2apic_phys=true" plus "tsc_mode = 'always_emulate'" are
adaquate workarounds all the way back to 4.14.  Now for the second
correct bugfix.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] x2apic broken with current AMD hardware
  2023-04-30 17:16                     ` Elliott Mitchell
@ 2023-05-02  5:59                       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-05-02  5:59 UTC (permalink / raw)
  To: Elliott Mitchell; +Cc: xen-devel

On 30.04.2023 19:16, Elliott Mitchell wrote:
> On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote:
>> On 20.03.2023 09:14, Jan Beulich wrote:
>>> On 17.03.2023 18:26, Elliott Mitchell wrote:
>>>> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
>>>>> On 16.03.2023 23:03, Elliott Mitchell wrote:
>>>>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>>>>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>>>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>>>>>>
>>>>>>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>>>>>>> the issue as well (on the assumption that your system has less than 255
>>>>>>>>> CPUs).
>>>>>>>>
>>>>>>>> I think I might have figured out the situation in a different fashion.
>>>>>>>>
>>>>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>>>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>>>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>>>>>>
>>>>>>>> That is the sort of setting I likely left at "Auto" and that may well
>>>>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>>>>>>> functionality on AMD is detecting whether the hardware is present, and
>>>>>>>> failing to test whether it has been enabled?  (could be useful to output
>>>>>>>> a message suggesting enabling the hardware feature)
>>>>>>>
>>>>>>> Can we please move to a little more technical terms here? What is "present"
>>>>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>>>>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>>>>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
>>>>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>>>>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>>>>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
>>>>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>>>>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>>>>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>>>>>>> speculation on my part ...
>>>>>>
>>>>>> I provided the information I had discovered.  There is a setting for this
>>>>>> motherboard (likely present on some similar motherboards) which /may/
>>>>>> effect the issue.  I doubt I've tried "compatibility", but none of the
>>>>>> values I've tried have gotten the system to boot without "x2apic=false"
>>>>>> on Xen's command-line.
>>>>>>
>>>>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
>>>>>> I see the line "(XEN) - x2APIC".  Later is the line
>>>>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
>>>>>> leaves the x2APIC turned off since neither line is present.
>>>>>
>>>>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
>>>>> mode. Are you sure that's the case when using "Auto"?
>>>>
>>>> grep -eAPIC\ driver -e-\ x2APIC:
>>>>
>>>> "Auto":
>>>> (XEN) Using APIC driver default
>>>> (XEN) Overriding APIC driver with bigsmp
>>>> (XEN) Switched to APIC driver x2apic_cluster
>>>>
>>>> "x2APIC":
>>>> (XEN) Using APIC driver x2apic_cluster
>>>> (XEN) - x2APIC
>>>>
>>>> Yes, I'm sure.
>>>
>>> Okay, this then means we're running in a mode we don't mean to run
>>> in: When the IOMMU claims to not support x2APIC mode (which is odd in
>>> the first place when at the same time the CPU reports x2APIC mode as
>>> supported), amd_iommu_prepare() is intended to switch interrupt
>>> remapping mode to "restricted" (which in turn would force x2APIC mode
>>> to "physical", not "clustered"). I notice though that there are a
>>> number of error paths in the function which bypass this setting. Could
>>> you add a couple of printk()s to understand which path is taken (each
>>> time; the function can be called more than once)?
>>
>> I think I've spotted at least one issue. Could you give the patch below
>> a try please? (Patch is fine for master and 4.17 but would need context
>> adjustment for 4.16.)
> 
> Given the patch didn't fix the problem, that wasn't the issue.  I did
> though manage to try another variant of BIOS settings for this
> motherboard.  Setting "Local APIC Mode" to "x2APIC" in the BIOS neither
> breaks anything additional, nor fixes issues.  What was in Xen's dmesg
> did change slightly and looks likely better for my purposes.  Some more
> snippets from 4.17 Xen dmesg, with "x2apic_phys=true":
> 
> (XEN) AMD-Vi: IOMMU Extended Features:
> (XEN) - Peripheral Page Service Request
> (XEN) - x2APIC
> (XEN) - NX bit
> (XEN) - Guest APIC Physical Processor Interrupt
> (XEN) - Invalidate All Command
> (XEN) - Guest APIC
> (XEN) - Performance Counters
> (XEN) - Host Address Translation Size: 0x2
> (XEN) - Guest Address Translation Size: 0
> (XEN) - Guest CR3 Root Table Level: 0x1
> (XEN) - Maximum PASID: 0xf
> (XEN) - SMI Filter Register: 0x1
> (XEN) - SMI Filter Register Count: 0x1
> (XEN) - Guest Virtual APIC Modes: 0x1
> (XEN) - Dual PPR Log: 0x2
> (XEN) - Dual Event Log: 0x2
> (XEN) - Secure ATS
> (XEN) - User / Supervisor Page Protection
> (XEN) - Device Table Segmentation: 0x3
> (XEN) - PPR Log Overflow Early Warning
> (XEN) - PPR Automatic Response
> (XEN) - Memory Access Routing and Control: 0x1
> (XEN) - Block StopMark Message
> (XEN) - Performance Optimization
> (XEN) - MSI Capability MMIO Access
> (XEN) - Guest I/O Protection
> (XEN) - Enhanced PPR Handling
> (XEN) - Invalidate IOTLB Type
> (XEN) - VM Table Size: 0x2
> (XEN) - Guest Access Bit Update Disable
> (XEN) AMD-Vi: Disabled HAP memory map sharing with IOMMU
> (XEN) AMD-Vi: IOMMU 0 Enabled.
> 
> 
> (XEN) I/O virtualisation enabled
> (XEN)  - Dom0 mode: Relaxed
> (XEN) Interrupt remapping enabled
> (XEN) nr_sockets: 1
> (XEN) Enabled directed EOI with ioapic_ack_old on!
> (XEN) Enabling APIC mode:  Physical.  Using 2 I/O APICs
> (XEN) ENABLING IO-APIC IRQs
> (XEN)  -> Using old ACK method
> 
> 
> (XEN) SVM: Supported advanced features:
> (XEN)  - Nested Page Tables (NPT)
> (XEN)  - Last Branch Record (LBR) Virtualisation
> (XEN)  - Next-RIP Saved on #VMEXIT
> (XEN)  - VMCB Clean Bits
> (XEN)  - DecodeAssists
> (XEN)  - Virtual VMLOAD/VMSAVE
> (XEN)  - Virtual GIF
> (XEN)  - Pause-Intercept Filter
> (XEN)  - Pause-Intercept Filter Threshold
> (XEN)  - TSC Rate MSR
> (XEN)  - NPT Supervisor Shadow Stack
> (XEN)  - MSR_SPEC_CTRL virtualisation
> (XEN) HVM: SVM enabled
> 
> If I'm reading that correctly, everything is there for x2APIC.  As such
> there seem to be 1 or 2 bugs:
> 
> The definite bug is the x2apic_cluster APIC driver fails on recent AMD
> processors.
> 
> I'm unsure whether selecting the x2apic_cluster APIC driver is correct or
> not.  Capabilities you used to only find a multi-socket server
> motherboards are now being found on desktop motherboards.  My
> understanding is this processor does NUMA on a single die, not merely a
> single-socket.  As such it may well need the features of x2apic_cluster,
> perhaps the driver assumes nr_socket > 1 which is untrue here?

Just to answer this one (I don't think there's much more I can do at
this point, without further information): No, there certainly isn't
such an assumption. Iirc the x2APIC code also predates the existence
of the nr_sockets variable (and the respective log line) by quite a
bit.

Jan

> Does appear "x2apic_phys=true" plus "tsc_mode = 'always_emulate'" are
> adaquate workarounds all the way back to 4.14.  Now for the second
> correct bugfix.
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-05-02  6:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08 15:23 [BUG] x2apic broken with current AMD hardware Elliott Mitchell
2023-03-08 15:37 ` Andrew Cooper
2023-03-08 15:50 ` Jan Beulich
2023-03-08 23:08   ` Elliott Mitchell
2023-03-09  9:03     ` Jan Beulich
2023-03-10  3:25       ` Elliott Mitchell
2023-03-10  8:22         ` Jan Beulich
2023-03-11  0:09       ` Elliott Mitchell
2023-03-13  7:01         ` Jan Beulich
2023-03-16 22:03           ` Elliott Mitchell
2023-03-17  8:22             ` Jan Beulich
2023-03-17 17:26               ` Elliott Mitchell
2023-03-20  8:14                 ` Jan Beulich
2023-03-20  8:28                   ` Jan Beulich
2023-03-20 17:50                     ` Andrew Cooper
2023-03-21  4:19                     ` Elliott Mitchell
2023-03-21  7:13                       ` Jan Beulich
2023-03-21 22:19                         ` Elliott Mitchell
2023-03-21 14:34                       ` Neowutran
2023-03-21 10:46                     ` Marek Marczykowski-Górecki
2023-04-30 17:16                     ` Elliott Mitchell
2023-05-02  5:59                       ` Jan Beulich
2023-03-20 15:37                   ` Elliott Mitchell
2023-03-20 15:52                     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.