On 17.02.21 09:12, Roman Shaposhnik wrote:
> Hi Jürgen, thanks for taking a look at this. A few comments below:
> 
> On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com> wrote:
>>
>> On 16.02.21 21:34, Stefano Stabellini wrote:
>>> + x86 maintainers
>>>
>>> It looks like the tlbflush is getting stuck?
>>
>> I have seen this case multiple times on customer systems now, but
>> reproducing it reliably seems to be very hard.
> 
> It is reliably reproducible under my workload but it take a long time
> (~3 days of the workload running in the lab).

This is by far the best reproduction rate I have seen up to now.

The next best reproducer seems to be a huge installation with several
hundred hosts and thousands of VMs with about 1 crash each week.

> 
>> I suspected fifo events to be blamed, but just yesterday I've been
>> informed of another case with fifo events disabled in the guest.
>>
>> One common pattern seems to be that up to now I have seen this effect
>> only on systems with Intel Gold cpus. Can it be confirmed to be true
>> in this case, too?
> 
> I am pretty sure mine isn't -- I can get you full CPU specs if that's useful.

Just the output of "grep model /proc/cpuinfo" should be enough.

> 
>> In case anybody has a reproducer (either in a guest or dom0) with a
>> setup where a diagnostic kernel can be used, I'd be _very_ interested!
> 
> I can easily add things to Dom0 and DomU. Whether that will disrupt the
> experiment is, of course, another matter. Still please let me know what
> would be helpful to do.

Is there a chance to switch to an upstream kernel in the guest? I'd like
to add some diagnostic code to the kernel and creating the patches will
be easier this way.


Juergen