On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote: > B. "scheduler broken" bugs. > > Information from >   Andrew Cooper >   Dario Faggioli > > Quoting Andrew Cooper > > We've had 4 or 5 reports of Xen not working, and very little > > investigation on whats going on.  Suspicion is that there might be > > two bugs, one with smt=0 on recent AMD hardware, and one more > > general "some workloads cause negative credit" and might or might > > not be specific to credit2 (debugging feedback differs - also might > > be 3 underlying issue). > > I reviewed a thread about this and it is not clear to me where we are > with this. > Ok, let me try to summarize the current status. - BUG: credit=sched2 machine hang when using DRAKVUF https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html https://bugzilla.opensuse.org/show_bug.cgi?id=1179246 99% sure that it's a Credit2 scheduler issue. I'm actively working on it. "Seems a tricky one; I'm still in the analysis phase" Manifests only with certain combination of hardware and workload.  I'm not reproducing, but there are multiple reports of it (see  above). I'm investigating and trying to come up at least with  debug patches that one of the reporter should be able and willing to  test. - Null scheduler and vwfi native problem https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html RCU issues, but manifests due to scheduler behavior (especially   NULL scheduler, especially on ARM). I'm actively working on it. Patches that should solve the issue for ARM posted already. They  will need to be slightly adjusted to cover x86 as well. Waiting a  couple days more for a confirmation from the reporter that the patches do help, at least on ARM. - Xen crash after S3 suspend - Xen 4.13 https://lists.xen.org/archives/html/xen-devel/2020-03/msg01251.html https://lists.xen.org/archives/html/xen-devel/2021-01/msg02620.html S3 suspend issue, but root cause seems to be in the scheduler. Marek is, as usual, providing good info and feedback. It comes as  third in my list (below the two above, basically), but I will look into it. - Ryzen 4000 (Mobile) Softlocks/Micro-stutters https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg00966.html Seems could be scheduling, but amount of info is limited. What we know is that with `dom0_max_vcpus=1 dom0_vcpus_pin`, all  schedulers seem to work fine. Without those params, Credit2 is the  "least bad", although not satisfactory. Other schedulers don't even  boot. Fact is, it is reported to occure on QubesOS, which has its own  downstream patches, plus there are no logs. There's a feeling that this (together with others) hints at SMT off  having issues on AMD (Ryzen?), but again, it's not crystal clear to  me whether this is the issue (or an issue at all) and, if yes, in  what subsystem the problem lays. I can try to have a look, mostly for trying to understand whether or  not it is really the case that some AMDs have issues with SMT=off. But that probably will be after I'll be done with the other issues  I've mentioned before (above) this one. - Recent upgrade of 4.13 -> 4.14 issue https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html To my judgment, It's not at all clear whether or not this is a  scheduler issue. And at least with the amount of info that we have  so far, I'd lean toward "no, it's not". I'm happy to help with it  anyway, of course, but it comes after the others. So, Ian, was this any helpful? If not, help me understand how I can help you. :-P Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <> (Raistlin Majere)