* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 12:12 ` [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs Ian Jackson
@ 2021-02-04 12:20 ` Andrew Cooper
2021-02-04 15:15 ` Ian Jackson
2021-02-04 14:20 ` Dario Faggioli
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Andrew Cooper @ 2021-02-04 12:20 UTC (permalink / raw)
To: Ian Jackson, committers, xen-devel
Cc: Dario Faggioli, Jan Beulich, Julien Grall, community.manager
On 04/02/2021 12:12, Ian Jackson wrote:
> OPEN ISSUES
> -----------
>
> A. HPET/PIT issue on newer Intel systems
>
> Information from
> Andrew Cooper <andrew.cooper3@citrix.com>
>
> | This has had literally tens of reports across the devel and users
> | mailing lists, and prevents Xen from booting at all on the past two
> | generations of Intel laptop. I've finally got a repro and posted a
> | fix to the list, but still in progress.
>
> I think Andrew is still on the case here.
Fixed. c/s e1de4c196a from a week ago.
> C. Fallout from MSR handling behavioral change.
>
> Information from
> Jan Beulich <jbeulich@suse.com>
>
> I am lacking an extended desxcription of this. What are the bug(s),
> and what is the situation ?
Still WIP and on my TODO list. In addition to Jan's report, there is a
separate report from Boris against Solaris. Also I need to revert a
patch of mine from early in the release and do the same thing differently.
Bugs are "VMs which boot on earlier releases don't boot on 4.15 at the
moment".
~Andrew
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 12:20 ` Andrew Cooper
@ 2021-02-04 15:15 ` Ian Jackson
0 siblings, 0 replies; 11+ messages in thread
From: Ian Jackson @ 2021-02-04 15:15 UTC (permalink / raw)
To: Andrew Cooper
Cc: committers, xen-devel, Dario Faggioli, Jan Beulich, Julien Grall,
community.manager
Andrew Cooper writes ("Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs"):
> On 04/02/2021 12:12, Ian Jackson wrote:
> > OPEN ISSUES
> > -----------
> >
> > A. HPET/PIT issue on newer Intel systems
...
> > I think Andrew is still on the case here.
>
> Fixed. c/s e1de4c196a from a week ago.
>
> > C. Fallout from MSR handling behavioral change.
...
> Still WIP and on my TODO list. In addition to Jan's report, there is a
> separate report from Boris against Solaris. Also I need to revert a
> patch of mine from early in the release and do the same thing differently.
>
> Bugs are "VMs which boot on earlier releases don't boot on 4.15 at the
> moment".
Thanks for this information, which I have noted.
Ian.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 12:12 ` [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs Ian Jackson
2021-02-04 12:20 ` Andrew Cooper
@ 2021-02-04 14:20 ` Dario Faggioli
2021-02-04 15:00 ` Tamas K Lengyel
2021-02-04 15:12 ` Ian Jackson
2021-02-04 14:30 ` Jan Beulich
2021-02-05 15:33 ` Jan Beulich
3 siblings, 2 replies; 11+ messages in thread
From: Dario Faggioli @ 2021-02-04 14:20 UTC (permalink / raw)
To: Ian Jackson, committers, xen-devel
Cc: Andrew Cooper, Jan Beulich, Julien Grall, community.manager
[-- Attachment #1: Type: text/plain, Size: 4194 bytes --]
On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote:
> B. "scheduler broken" bugs.
>
> Information from
> Andrew Cooper <andrew.cooper3@citrix.com>
> Dario Faggioli <dfaggioli@suse.com>
>
> Quoting Andrew Cooper
> > We've had 4 or 5 reports of Xen not working, and very little
> > investigation on whats going on. Suspicion is that there might be
> > two bugs, one with smt=0 on recent AMD hardware, and one more
> > general "some workloads cause negative credit" and might or might
> > not be specific to credit2 (debugging feedback differs - also might
> > be 3 underlying issue).
>
> I reviewed a thread about this and it is not clear to me where we are
> with this.
>
Ok, let me try to summarize the current status.
- BUG: credit=sched2 machine hang when using DRAKVUF
https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html
https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html
https://bugzilla.opensuse.org/show_bug.cgi?id=1179246
99% sure that it's a Credit2 scheduler issue.
I'm actively working on it.
"Seems a tricky one; I'm still in the analysis phase"
Manifests only with certain combination of hardware and workload.
I'm not reproducing, but there are multiple reports of it (see
above). I'm investigating and trying to come up at least with
debug patches that one of the reporter should be able and willing to
test.
- Null scheduler and vwfi native problem
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html
RCU issues, but manifests due to scheduler behavior (especially
NULL scheduler, especially on ARM).
I'm actively working on it.
Patches that should solve the issue for ARM posted already. They
will need to be slightly adjusted to cover x86 as well. Waiting a
couple days more for a confirmation from the reporter that the
patches do help, at least on ARM.
- Xen crash after S3 suspend - Xen 4.13
https://lists.xen.org/archives/html/xen-devel/2020-03/msg01251.html
https://lists.xen.org/archives/html/xen-devel/2021-01/msg02620.html
S3 suspend issue, but root cause seems to be in the scheduler.
Marek is, as usual, providing good info and feedback. It comes as
third in my list (below the two above, basically), but I will look
into it.
- Ryzen 4000 (Mobile) Softlocks/Micro-stutters
https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg00966.html
Seems could be scheduling, but amount of info is limited.
What we know is that with `dom0_max_vcpus=1 dom0_vcpus_pin`, all
schedulers seem to work fine. Without those params, Credit2 is the
"least bad", although not satisfactory. Other schedulers don't even
boot.
Fact is, it is reported to occure on QubesOS, which has its own
downstream patches, plus there are no logs.
There's a feeling that this (together with others) hints at SMT off
having issues on AMD (Ryzen?), but again, it's not crystal clear to
me whether this is the issue (or an issue at all) and, if yes, in
what subsystem the problem lays.
I can try to have a look, mostly for trying to understand whether or
not it is really the case that some AMDs have issues with SMT=off.
But that probably will be after I'll be done with the other issues
I've mentioned before (above) this one.
- Recent upgrade of 4.13 -> 4.14 issue
https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html
To my judgment, It's not at all clear whether or not this is a
scheduler issue. And at least with the amount of info that we have
so far, I'd lean toward "no, it's not". I'm happy to help with it
anyway, of course, but it comes after the others.
So, Ian, was this any helpful?
If not, help me understand how I can help you. :-P
Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 14:20 ` Dario Faggioli
@ 2021-02-04 15:00 ` Tamas K Lengyel
2021-02-04 18:22 ` Dario Faggioli
2021-02-04 15:12 ` Ian Jackson
1 sibling, 1 reply; 11+ messages in thread
From: Tamas K Lengyel @ 2021-02-04 15:00 UTC (permalink / raw)
To: Dario Faggioli
Cc: Ian Jackson, Committers, Xen-devel, Andrew Cooper, Jan Beulich,
Julien Grall, community.manager
On Thu, Feb 4, 2021 at 9:21 AM Dario Faggioli <dfaggioli@suse.com> wrote:
>
> On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote:
> > B. "scheduler broken" bugs.
> >
> > Information from
> > Andrew Cooper <andrew.cooper3@citrix.com>
> > Dario Faggioli <dfaggioli@suse.com>
> >
> > Quoting Andrew Cooper
> > > We've had 4 or 5 reports of Xen not working, and very little
> > > investigation on whats going on. Suspicion is that there might be
> > > two bugs, one with smt=0 on recent AMD hardware, and one more
> > > general "some workloads cause negative credit" and might or might
> > > not be specific to credit2 (debugging feedback differs - also might
> > > be 3 underlying issue).
> >
> > I reviewed a thread about this and it is not clear to me where we are
> > with this.
> >
> Ok, let me try to summarize the current status.
>
> - BUG: credit=sched2 machine hang when using DRAKVUF
>
> https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html
> https://bugzilla.opensuse.org/show_bug.cgi?id=1179246
>
> 99% sure that it's a Credit2 scheduler issue.
> I'm actively working on it.
> "Seems a tricky one; I'm still in the analysis phase"
>
> Manifests only with certain combination of hardware and workload.
> I'm not reproducing, but there are multiple reports of it (see
> above). I'm investigating and trying to come up at least with
> debug patches that one of the reporter should be able and willing to
> test.
>
> - Null scheduler and vwfi native problem
>
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html
>
> RCU issues, but manifests due to scheduler behavior (especially
> NULL scheduler, especially on ARM).
> I'm actively working on it.
>
> Patches that should solve the issue for ARM posted already. They
> will need to be slightly adjusted to cover x86 as well. Waiting a
> couple days more for a confirmation from the reporter that the
> patches do help, at least on ARM.
>
I've run into null-scheduler causing CPU lockups as well on x86.
Required physical machine reboot. Seems to be triggered with domain
destruction when destroying fork vms. Happens only intermittently.
Tamas
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 15:00 ` Tamas K Lengyel
@ 2021-02-04 18:22 ` Dario Faggioli
0 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2021-02-04 18:22 UTC (permalink / raw)
To: Tamas K Lengyel
Cc: Ian Jackson, Committers, Xen-devel, Andrew Cooper, Jan Beulich,
Julien Grall, community.manager
[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]
On Thu, 2021-02-04 at 10:00 -0500, Tamas K Lengyel wrote:
> On Thu, Feb 4, 2021 at 9:21 AM Dario Faggioli <dfaggioli@suse.com>
> wrote:
> >
> > On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote:
> > > B. "scheduler broken" bugs.
> >
> > - Null scheduler and vwfi native problem
> >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html
> >
> > RCU issues, but manifests due to scheduler behavior (especially
> > NULL scheduler, especially on ARM).
> > I'm actively working on it.
> >
> > Patches that should solve the issue for ARM posted already. They
> > will need to be slightly adjusted to cover x86 as well. Waiting a
> > couple days more for a confirmation from the reporter that the
> > patches do help, at least on ARM.
> >
>
> I've run into null-scheduler causing CPU lockups as well on x86.
> Required physical machine reboot. Seems to be triggered with domain
> destruction when destroying fork vms. Happens only intermittently.
>
Yes, we know that it's generic and not ARM-only. It's just that on ARM
is easier (or I should say deterministic) to trigger it.
Thanks for reporting, though.
I'll add you in Cc when I send the updated version of the patches that
covers x86 as well, in case you want to test. :-)
Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 14:20 ` Dario Faggioli
2021-02-04 15:00 ` Tamas K Lengyel
@ 2021-02-04 15:12 ` Ian Jackson
1 sibling, 0 replies; 11+ messages in thread
From: Ian Jackson @ 2021-02-04 15:12 UTC (permalink / raw)
To: Dario Faggioli
Cc: committers, xen-devel, Andrew Cooper, Jan Beulich, Julien Grall,
community.manager
Dario Faggioli writes ("Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs"):
> On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote:
> > I reviewed a thread about this and it is not clear to me where we are
> > with this.
.
> Ok, let me try to summarize the current status.
Thanks.
> - BUG: credit=sched2 machine hang when using DRAKVUF
>
> https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html
> https://bugzilla.opensuse.org/show_bug.cgi?id=1179246
>
> 99% sure that it's a Credit2 scheduler issue.
> I'm actively working on it.
> "Seems a tricky one; I'm still in the analysis phase"
>
> Manifests only with certain combination of hardware and workload.
> I'm not reproducing, but there are multiple reports of it (see
> above). I'm investigating and trying to come up at least with
> debug patches that one of the reporter should be able and willing to
> test.
I think this is a clear blocker for 4.15. I will call it "F".
> - Null scheduler and vwfi native problem
>
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html
>
> RCU issues, but manifests due to scheduler behavior (especially
> NULL scheduler, especially on ARM).
> I'm actively working on it.
>
> Patches that should solve the issue for ARM posted already. They
> will need to be slightly adjusted to cover x86 as well. Waiting a
> couple days more for a confirmation from the reporter that the
> patches do help, at least on ARM.
I'm not sure whether this is a blocker but it looks like it is going
to be fixed so I will keep it on my list. I will call it "G".
> - Xen crash after S3 suspend - Xen 4.13
>
> https://lists.xen.org/archives/html/xen-devel/2020-03/msg01251.html
> https://lists.xen.org/archives/html/xen-devel/2021-01/msg02620.html
>
> S3 suspend issue, but root cause seems to be in the scheduler.
>
> Marek is, as usual, providing good info and feedback. It comes as
> third in my list (below the two above, basically), but I will look
> into it.
This is not a blocker so I won't track it explicitly but I would
very much welcome a fix if it is simple or comes quickly.
> - Ryzen 4000 (Mobile) Softlocks/Micro-stutters
>
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg00966.html
>
> Seems could be scheduling, but amount of info is limited.
>
> What we know is that with `dom0_max_vcpus=1 dom0_vcpus_pin`, all
> schedulers seem to work fine. Without those params, Credit2 is the
> "least bad", although not satisfactory. Other schedulers don't even
> boot.
> Fact is, it is reported to occure on QubesOS, which has its own
> downstream patches, plus there are no logs.
> There's a feeling that this (together with others) hints at SMT off
> having issues on AMD (Ryzen?), but again, it's not crystal clear to
> me whether this is the issue (or an issue at all) and, if yes, in
> what subsystem the problem lays.
> I can try to have a look, mostly for trying to understand whether or
> not it is really the case that some AMDs have issues with SMT=off.
> But that probably will be after I'll be done with the other issues
> I've mentioned before (above) this one.
I'm not sure whether you are saying (a) our current code is not
useable on this hardware because of this issue, or on the other hand
(b) you think the issue is specific to downstream patches ?
Do you think I should consider this a blocker for 4.15 ?
> - Recent upgrade of 4.13 -> 4.14 issue
>
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html
>
> To my judgment, It's not at all clear whether or not this is a
> scheduler issue. And at least with the amount of info that we have
> so far, I'd lean toward "no, it's not". I'm happy to help with it
> anyway, of course, but it comes after the others.
Again, I think this is not a regression so not a blocker for 4.15.
> So, Ian, was this any helpful?
Yes, very much so, thank you.
Ian.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 12:12 ` [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs Ian Jackson
2021-02-04 12:20 ` Andrew Cooper
2021-02-04 14:20 ` Dario Faggioli
@ 2021-02-04 14:30 ` Jan Beulich
2021-02-04 15:18 ` Ian Jackson
2021-02-05 15:33 ` Jan Beulich
3 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2021-02-04 14:30 UTC (permalink / raw)
To: Ian Jackson
Cc: Andrew Cooper, Dario Faggioli, Julien Grall, community.manager,
committers, xen-devel
On 04.02.2021 13:12, Ian Jackson wrote:
> OPEN ISSUES
> -----------
>
> A. HPET/PIT issue on newer Intel systems
> [...]
>
> B. "scheduler broken" bugs.
>
> Information from
> Andrew Cooper <andrew.cooper3@citrix.com>
> Dario Faggioli <dfaggioli@suse.com>
>
> Quoting Andrew Cooper
> | We've had 4 or 5 reports of Xen not working, and very little
> | investigation on whats going on. Suspicion is that there might be
> | two bugs, one with smt=0 on recent AMD hardware, and one more
> | general "some workloads cause negative credit" and might or might
> | not be specific to credit2 (debugging feedback differs - also might
> | be 3 underlying issue).
>
> I reviewed a thread about this and it is not clear to me where we are
> with this.
I'm not sure Marek's "Xen crash after S3 suspend - Xen 4.13 and newer"
falls in either of the two buckets.
> C. Fallout from MSR handling behavioral change.
>
> Information from
> Jan Beulich <jbeulich@suse.com>
>
> I am lacking an extended desxcription of this. What are the bug(s),
> and what is the situation ?
>
>
> D. Use-after-free in the IOMMU code
>
> Information from
> Julien Grall <julien@xen.org>
> References
> [PATCH for-4.15 0/4] xen/iommu: Collection of bug fixes for IOMMU teadorwn
> <20201222154338.9459-1-julien@xen.org>
>
> Quoting the 0/:
> | This series is a collection of bug fixes for the IOMMU teardown code.
> | All of them are candidate for 4.15 as they can either leak memory or
> | lead to host crash/host corruption.
>
> AFAIT these patches are not yet in-tree.
(since you're continuing with E. further down)
F. The almost-XSA "x86/PV: avoid speculation abuse through guest
accessors" - the first 4 patches are needed to address the actual
issue. The next 3 patches are needed to get the tree into
consistent state again, identifier-wise. The remaining patches
can probably wait.
> CLOSED ISSUES
> =============
>
> E. zstd support
>
> Information from
> Andrew Cooper <andrew.cooper3@citrix.com>
> Jan Beulich <jbeulich@suse.com>
> git
>
> Needed to unbreak Fedora. Needs support for both dom0 and domU.
>
> AFAICT this seems to be in-tree as of 8169f82049ef
> "libxenguest: support zstd compressed kernels"
Indeed.
Jan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 14:30 ` Jan Beulich
@ 2021-02-04 15:18 ` Ian Jackson
0 siblings, 0 replies; 11+ messages in thread
From: Ian Jackson @ 2021-02-04 15:18 UTC (permalink / raw)
To: Jan Beulich
Cc: Andrew Cooper, Dario Faggioli, Julien Grall, community.manager,
committers, xen-devel
Jan Beulich writes ("Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs"):
> On 04.02.2021 13:12, Ian Jackson wrote:
> > OPEN ISSUES
> > -----------
...
> > I reviewed a thread about this and it is not clear to me where we are
> > with this.
>
> I'm not sure Marek's "Xen crash after S3 suspend - Xen 4.13 and newer"
> falls in either of the two buckets.
I think this is not a regression. though ? See my reply to Dario.
Unless it is worse in 4.15 than earlier releases I'm not inclined to
see it as a blocker.
> (since you're continuing with E. further down)
>
> F. The almost-XSA "x86/PV: avoid speculation abuse through guest
> accessors" - the first 4 patches are needed to address the actual
> issue. The next 3 patches are needed to get the tree into
> consistent state again, identifier-wise. The remaining patches
> can probably wait.
Thanks. I have made a note of this.
I have to allocate the letters or it'll be chaos :-). I'm calling
this I.
Ian.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs
2021-02-04 12:12 ` [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs Ian Jackson
` (2 preceding siblings ...)
2021-02-04 14:30 ` Jan Beulich
@ 2021-02-05 15:33 ` Jan Beulich
3 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2021-02-05 15:33 UTC (permalink / raw)
To: Ian Jackson
Cc: Andrew Cooper, Dario Faggioli, Julien Grall, community.manager,
committers, xen-devel
On 04.02.2021 13:12, Ian Jackson wrote:
> Although there are a few things outstanding, we are now firmly into
> the bugfixing phase of the Xen 4.15 release.
>
> I searched my email (and my memory) and found four open blockers which
> I have listed below, and one closed blocker.
>
> I feel there are probably more issues out there, so please let me
> know, in response to this mail, of any other significant bugs you are
> aware of.
>
> Ian.
>
>
> OPEN ISSUES
> -----------
Roger has just pointed out to me that I should probably ask for
"x86/time: calibration rendezvous adjustments" to also be tracked
here. Though just to clarify - the bad behavior has been there
for a (long) while, so this isn't like a recent regression.
Jan
^ permalink raw reply [flat|nested] 11+ messages in thread