Re: Linux DomU freezes and dies under heavy memory shuffling

From: "Jürgen Groß" <jgross@suse.com>
To: Roman Shaposhnik <roman@zededa.com>
Cc: "Stefano Stabellini" <sstabellini@kernel.org>,
	Xen-devel <xen-devel@lists.xenproject.org>,
	"Jan Beulich" <jbeulich@suse.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"George Dunlap" <george.dunlap@citrix.com>
Subject: Re: Linux DomU freezes and dies under heavy memory shuffling
Date: Sat, 13 Mar 2021 08:18:04 +0100	[thread overview]
Message-ID: <a283c8a6-96ef-870e-095a-0b7adacb34a0@suse.com> (raw)
In-Reply-To: <CAMmSBy_0zCa1D5dpw4VFAcJwSiE6RAQoBk5vAJzW1ZPk5Zaxww@mail.gmail.com>

[-- Attachment #1.1.1: Type: text/plain, Size: 7225 bytes --]

On 12.03.21 22:33, Roman Shaposhnik wrote:
> Hi Jürgen,
> 
> just wanted to give you (and everyone who may be keeping an eye on
> this) an update.
> 
> Somehow, after applying your kernel patch -- the VM is now running 10
> days+ without a problem.

Can you check the kernel console messages, please? There are messages
printed when a potential hang is detected, and the hanging cpu is
tried to be woken up via another interrupt again.

Look for messages containing "csd", so e.g. do

dmesg | grep csd

in the VM.

Thanks,

Juergen

> 
> I'll keep experimenting (A/B-testing style) but at this point I'm
> actually pretty perplexed as to why this patch would make a difference
> (since it is basically just for observability). Any thoughts on that?
> 
> Thanks,
> Roman.
> 
> On Wed, Feb 24, 2021 at 7:06 PM Roman Shaposhnik <roman@zededa.com> wrote:
>>
>> Hi Jürgen!
>>
>> sorry for the belated reply -- I wanted to externalize the VM before I
>> do -- but let me at least reply to you:
>>
>> On Tue, Feb 23, 2021 at 5:17 AM Jürgen Groß <jgross@suse.com> wrote:
>>>
>>> On 18.02.21 06:21, Roman Shaposhnik wrote:
>>>> On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com
>>>> <mailto:jgross@suse.com>> wrote:
>>>>
>>>>      On 17.02.21 09:12, Roman Shaposhnik wrote:
>>>>       > Hi Jürgen, thanks for taking a look at this. A few comments below:
>>>>       >
>>>>       > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
>>>>      <mailto:jgross@suse.com>> wrote:
>>>>       >>
>>>>       >> On 16.02.21 21:34, Stefano Stabellini wrote:
>>>>       >>> + x86 maintainers
>>>>       >>>
>>>>       >>> It looks like the tlbflush is getting stuck?
>>>>       >>
>>>>       >> I have seen this case multiple times on customer systems now, but
>>>>       >> reproducing it reliably seems to be very hard.
>>>>       >
>>>>       > It is reliably reproducible under my workload but it take a long time
>>>>       > (~3 days of the workload running in the lab).
>>>>
>>>>      This is by far the best reproduction rate I have seen up to now.
>>>>
>>>>      The next best reproducer seems to be a huge installation with several
>>>>      hundred hosts and thousands of VMs with about 1 crash each week.
>>>>
>>>>       >
>>>>       >> I suspected fifo events to be blamed, but just yesterday I've been
>>>>       >> informed of another case with fifo events disabled in the guest.
>>>>       >>
>>>>       >> One common pattern seems to be that up to now I have seen this
>>>>      effect
>>>>       >> only on systems with Intel Gold cpus. Can it be confirmed to be true
>>>>       >> in this case, too?
>>>>       >
>>>>       > I am pretty sure mine isn't -- I can get you full CPU specs if
>>>>      that's useful.
>>>>
>>>>      Just the output of "grep model /proc/cpuinfo" should be enough.
>>>>
>>>>
>>>> processor: 3
>>>> vendor_id: GenuineIntel
>>>> cpu family: 6
>>>> model: 77
>>>> model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
>>>> stepping: 8
>>>> microcode: 0x12d
>>>> cpu MHz: 1200.070
>>>> cache size: 1024 KB
>>>> physical id: 0
>>>> siblings: 4
>>>> core id: 3
>>>> cpu cores: 4
>>>> apicid: 6
>>>> initial apicid: 6
>>>> fpu: yes
>>>> fpu_exception: yes
>>>> cpuid level: 11
>>>> wp: yes
>>>> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
>>>> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp
>>>> lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
>>>> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
>>>> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
>>>> aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp
>>>> tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida
>>>> arat md_clear
>>>> vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority
>>>> tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
>>>> bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
>>>> bogomips: 4800.19
>>>> clflush size: 64
>>>> cache_alignment: 64
>>>> address sizes: 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>>       >
>>>>       >> In case anybody has a reproducer (either in a guest or dom0) with a
>>>>       >> setup where a diagnostic kernel can be used, I'd be _very_
>>>>      interested!
>>>>       >
>>>>       > I can easily add things to Dom0 and DomU. Whether that will
>>>>      disrupt the
>>>>       > experiment is, of course, another matter. Still please let me
>>>>      know what
>>>>       > would be helpful to do.
>>>>
>>>>      Is there a chance to switch to an upstream kernel in the guest? I'd like
>>>>      to add some diagnostic code to the kernel and creating the patches will
>>>>      be easier this way.
>>>>
>>>>
>>>> That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade
>>>> the kernel I'll have fiddle with a lot things to make workload
>>>> functional again.
>>>>
>>>> However, I can install debug kernel (from Ubuntu, etc. etc.)
>>>>
>>>> Of course, if patching the kernel is the only way to make progress --
>>>> lets try that -- please let me know.
>>>
>>> I have found a nice upstream patch, which - with some modifications - I
>>> plan to give our customer as a workaround.
>>>
>>> The patch is for kernel 4.12, but chances are good it will apply to a
>>> 4.15 kernel, too.
>>
>> I'm slightly confused about this patch -- it seems to me that it needs
>> to be applied to the guest kernel, correct?
>>
>> If that's the case -- the challenge I have is that I need to re-build
>> the Canonical (Ubuntu) distro kernel with this patch -- this seems
>> a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
>> I just never do it with the vendor ones ;-)).
>>
>> So... if there's anyone here who has any suggestions on how to do that
>> -- I'd appreciate pointers.
>>
>>> I have been able to gather some more data.
>>>
>>> I have contacted the author of the upstream kernel patch I've been using
>>> for our customer (and that helped, by the way).
>>>
>>> It seems as if the problem is occurring when running as a guest at least
>>> under Xen, KVM, and VMWare, and there have been reports of bare metal
>>> cases, too. Hunting this bug is going on for several years now, the
>>> patch author is at it since 8 months.
>>>
>>> So we can rule out a Xen problem.
>>>
>>> Finding the root cause is still important, of course, and your setup
>>> seems to have the best reproduction rate up to now.
>>>
>>> So any help would really be appreciated.
>>>
>>> Is the VM self contained? Would it be possible to start it e.g. on a
>>> test system on my side? If yes, would you be allowed to pass it on to
>>> me?
>>
>> I'm working on externalizing the VM in a way that doesn't disclose anything
>> about the customer workload. I'm almost there -- sans my question about
>> the vendor kernel rebuild. I plan to make that VM available this week.
>>
>> Goes without saying, but I would really appreciate your help in chasing this.
>>
>> Thanks,
>> Roman.
> 

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]