On 17.02.21 09:12, Roman Shaposhnik wrote: > Hi Jürgen, thanks for taking a look at this. A few comments below: > > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß wrote: >> >> On 16.02.21 21:34, Stefano Stabellini wrote: >>> + x86 maintainers >>> >>> It looks like the tlbflush is getting stuck? >> >> I have seen this case multiple times on customer systems now, but >> reproducing it reliably seems to be very hard. > > It is reliably reproducible under my workload but it take a long time > (~3 days of the workload running in the lab). This is by far the best reproduction rate I have seen up to now. The next best reproducer seems to be a huge installation with several hundred hosts and thousands of VMs with about 1 crash each week. > >> I suspected fifo events to be blamed, but just yesterday I've been >> informed of another case with fifo events disabled in the guest. >> >> One common pattern seems to be that up to now I have seen this effect >> only on systems with Intel Gold cpus. Can it be confirmed to be true >> in this case, too? > > I am pretty sure mine isn't -- I can get you full CPU specs if that's useful. Just the output of "grep model /proc/cpuinfo" should be enough. > >> In case anybody has a reproducer (either in a guest or dom0) with a >> setup where a diagnostic kernel can be used, I'd be _very_ interested! > > I can easily add things to Dom0 and DomU. Whether that will disrupt the > experiment is, of course, another matter. Still please let me know what > would be helpful to do. Is there a chance to switch to an upstream kernel in the guest? I'd like to add some diagnostic code to the kernel and creating the patches will be easier this way. Juergen