[Adding George, since it's scheduling] On Mon, 2021-03-15 at 12:18 +0000, Ian Jackson wrote: > > OPEN ISSUES AND BLOCKERS > ======================== > > [...] > > SCHEDULER ISSUES NOT MAKING PROCESS ? > ------------------------------------- > Yeah... let's try. > BUG: credit=sched2 machine hang when using DRAKVUF > > Information from >   Dario Faggioli > References >   https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html >   > https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html >   https://bugzilla.opensuse.org/show_bug.cgi?id=1179246 > So, this is mostly about the third issue, the one described in the openSUSE bug, which was however also reported here, by different people. As I've just wrote there (on the bug), I've been working on trying to reproduce the problem on a variety of different machines. Seems AMD seemed to be the most impacted, I've lately focused on hardware from such vendor. I have been, however, unable to re-create a situation where the symptoms described in the reports occur. I specifically looked for hardware that was the same, or similar enough, and I replayed the dom0 vcpu pinning configuration and the creation of domUs, both PV and HVM, but the problem did not show up for me. The only difference between what I've done so far and what is described, e.g., in the bug is that I've not been able to check Windows guests yet. (I'll try that as soon as I can, but if this would really be a scheduling issue, which OS runs in the guest should not really matter much, I think). Code inspection for something that comes from and/or affects the scheduler and is both: - CPU-vendor specific, and - guest-type specific also led me pretty much nowhere. I produced a debug patch (I attach two versions of it, one for staging and one for v4.13.2) that should help me tell whether or not the scheduler is being invoked every time it should be and whether or not there are vcpus that manages to run for longer than how the scheduler would want them to. But as you can imagine, a debug patch is not really helpful if it can't be used within the scenario it is meant to debug, i.e., without a reproducer. I did manage to find an actual bug in Credit2, but that's totally unrelated to the problem at hand (and that will hence be discussed in another email). So, that's the status. I definitely was hoping for things to be better at this point of the release cycle. Sorry they're not. And of course I'll keep digging, but unless I find a way to reproduce, I don't expect big breakthrough. :-/ > G. Null scheduler and vwfi native problem > > Information from >   Dario Faggioli > > References >   > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html > > Quoting Dario: > > RCU issues, but manifests due to scheduler behavior (especially   > > NULL scheduler, especially on ARM). > > > > Patches that should solve the issue for ARM posted already. They > > will need to be slightly adjusted to cover x86 as well. > > As of last update from Dario 29.1.21: > waiting for test report from submitter. > For this, I made progress toward making an actual patch that works for both ARM and x86, but I've been sidetracked by a number of things, and have not finished it. The ARM-only fix has been tested successfully and would be ready already. The full solution may not be ready in time for 4.15. So, I'd say we can either merge the ARM part (ARM is where the issue manifests most of the times and more severely) or wait for a full solution during 4.16 development, which we will then backport. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <> (Raistlin Majere)