On Thu, 2021-01-14 at 19:02 +0000, Andrew Cooper wrote: > On 14/01/2021 16:06, Ian Jackson wrote: > > The last posting date for new feature patches for Xen 4.15 is > > tomorrow. [1]  We seem to be getting a reasonable good flood of > > stuff > > trying to meet this deadline :-). > > > > Patches for new fetures posted after tomorrow will be deferred to > > the > > next Xen release after 4.15.  NB the primary responsibility for > > driving a feature's progress to meet the release schedule, lies > > with > > the feature's proponent(s). > > > > > >   As a reminder, here is the release schedule: > > + (unchanged information indented with spaces): > > > >    Friday 15th January    Last posting date > > > >        Patches adding new features should be posted to the mailing > > list > >        by this cate, although perhaps not in their final version. > > > >    Friday 29th January    Feature freeze > > > >        Patches adding new features should be committed by this > > date. > >        Straightforward bugfixes may continue to be accepted by > >        maintainers. > > > >    Friday 12th February **tentatve**   Code freeze > > > >        Bugfixes only, all changes to be approved by the Release > > Manager. > > > >    Week of 12th March **tentative**    Release > >        (probably Tuesday or Wednesday) > > > >   Any patches containing substantial refactoring are to treated as > >   new features, even if they intent is to fix bugs. > > > >   Freeze exceptions will not be routine, but may be granted in > >   exceptional cases for small changes on the basis of risk > > assessment. > >   Large series will not get exceptions.  Contributors *must not* > > rely on > >   getting, or expect, a freeze exception. > > > > + New or improved tests (supposing they do not involve refactoring, > > + even build system reorganisation), and documentation > > improvements, > > + will generally be treated as bugfixes. > > > >   The codefreeze and release dates are provisional and will be > > adjusted > >   in the light of apparent code quality etc. > > > >   If as a feature proponent you feel your feature is at risk and > > there > >   is something the Xen Project could do to help, please consult me > > or > >   the Community Manager.  In such situations please reach out > > earlier > >   rather than later. > > > > > > In my last update I asked this: > > > > > If you are working on a feature you want in 4.15 please let me > > > know > > > about it.  Ideally I'd like a little stanza like this: > > > > > > S: feature name > > > O: feature owner (proponent) name > > > E: feature owner (proponent) email address > > > P: your current estimate of the probability it making 4.15, as a > > > %age > > > > > > But free-form text is OK too.  Please reply to this mail. > > I received one mail.  Thanks to Oleksandr Andrushchenko for his > > update > > on the following feeature: > > > >   IOREQ feature (+ virtio-mmio) on Arm > >   > > https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg87002.html > > > >   Julien Grall > >   Oleksandr Tyshchenko > > > > I see that V4 of this series was just posted.  Thanks, Oleksandr. > > I'll make a separate enquiry about your series. > > > > I think if people don't find the traditional feature tracking > > useful, > > I will try to assemble Release Notes information later, during the > > freeze, when fewer people are rushing to try to meet the deadlines. > > (Now I have working email). > > Features: > > 1) acquire_resource fixes. > > Not really a new feature - entirely bugfixing a preexisting one. > Developed by me to help 2).  Reasonably well acked, but awaiting > feedback on v3. > > 2) External Processor Trace support. > > Development by Michał.  Depends on 1), and awaiting a new version > being > posted. > > As far as I'm aware, both Intel and CERT have production systems > deployed using this functionality, so it is very highly desirable to > get > into 4.15. > > 3) Initial Trenchboot+SKINIT support. > > I've got two patches I need to clean up and submit which is the first > part of the Trenchboot + Dynamic Root of Trust on AMD support.  This > will get Xen into a position where it can be started via the new grub > "secure_launch" protocol. > > Later patches (i.e. post 4.15) will do support for Intel TXT (i.e. > without tboot), as well as the common infrastructure for the TPM > event > log and further measurements during the boot process. > > 4) "simple" autotest support. > > > Bugs: > > 1) HPET/PIT issue on newer Intel systems.  This has had literally > tens > of reports across the devel and users mailing lists, and prevents Xen > from booting at all on the past two generations of Intel laptop.  > I've > finally got a repro and posted a fix to the list, but still in > progress. > > 2) "scheduler broken" bugs.  We've had 4 or 5 reports of Xen not > working, and very little investigation on whats going on.  Suspicion > is > that there might be two bugs, one with smt=0 on recent AMD hardware, > and > one more general "some workloads cause negative credit" and might or > might not be specific to credit2 (debugging feedback differs - also > might be 3 underlying issue). > Yep, so, let's try to summarize/collect the ones I think you may be referring to: 1) There is one report about Credit2 not working, while Credit1 was fine. It's this one: https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html It's the one where it somehow happens that one or more vCPUs manage to run for a really really long timeslice, much more than the scheduler would have allowed them to, and this cause problems. _If_ that's it, my investigation so far seems to show that this happens despite scheduler code tries to enforce (via timers) the proper timeslice limits. when it happens, makes the scheduler very unhappy. I've see reports of it occurring both on Credit and Credit2, but definitely Credit2 seems to be more sensitive to it. I've actually been trying to track it down for a while now, but I can't easily reproduce it, so it's proving to be challenging. 2) Then there has been his one: https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01005.html Here, the where reporter said that "[credit1] results is an observable delay, unusable performance; credit2 seems to be the only usable scheduler". This is the one that also Andrew mention, happening on Ryzen and with SMT disabled (as this is on QubesOS, IIRC). Here, doing "dom0_max_vcpus=1 dom0_vcpus_pin" seemed to mitigate the problem but, of course, with obvious limitations. I don't have a Ryzen handy, but I have a Zen and a Zen2. I checked there and again could not reproduce (although, what I tried was upstream Xen, not QubesOS). 3) Then I recall this one: https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html This also started as a "scheduler, probably Credit2" bug. But it then turned out manifests on both Credit1 and Credit2 and it started to happen on 4.14, while it was not there in 4.13... And nothing major changed in scheduling between these two releases, I think. During the analysis, we thought we identified a livelock, but then could not pinpoint what was exactly going on. Oh, and then it was also discovered that Credit2 + PVH dom0 seemed to be a working configuration, and it's weird for a scheduling issue to have a (dom0) domain type dependency, I think. But that could be anything really... and I'm sure happy to keep digging. 4) There's the NULL scheduler + ARM + vwfi=native issue: https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html This looks like something that we saw before, but remained unfixed, although not exactly like that. If it's that one, analysis is done, and we're working on a patch. If it's something else or even something similar but slightly different... Well, we'll have to see when we have the patch. 5) We're also dealing with this bugreport, although this is being reported against Xen 4.13 (openSUSE 's packaged version of it): https://bugzilla.opensuse.org/show_bug.cgi?id=1179246 This is again on recent AMD hardware and here, "dom0_max_vcpus=4 dom0_vcpus_pin" works ok, but only until a (Windows) HVM guest is started. When that happens, though, we have crashes/hangs. If guests are PV, things are apparently fine. If the HVM guests use a different set of CPUs than dom0 (e.g., vm.cpumask="4-63" in xl.conf), thinks are fine as well. Again a scheduler issue and a scheduling algorithm dependency was theorized and will be investigated (if the user can come back with answers, which may take some time, as explained in the report). The different behavior with different kind of guests is a little weird for an issue of this kind, IME, but let's see. 6) If we want, we can include this too (hopefully just for reference): https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01376.html As indeed the symptoms were similar, such as hanging during boot, but all fine with dom0_max_vcpus=1. However, Jan is currently investigating this one, and they're heading toward problems with TSC reliability reporting and rendezvous, but let's see. Did I forget any? As for "the plan", I am currently working on 4 (trying to come up with a patch that fixes it) and on 1 (trying to come up with a way to track down and uncover what I believe is the real issue). Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <> (Raistlin Majere)