All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux DomU freezes and dies under heavy memory shuffling
@ 2021-02-06 20:03 Roman Shaposhnik
  2021-02-16 20:34 ` Stefano Stabellini
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2021-02-06 20:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Roman Shaposhnik

Hi!

all of a sudden (but only after a few days of running normally), on a stock
Ubuntu 18.04 (Bionic with 4.15.0 kernel) DomU I'm seeing Microsoft's .net
runtime go into a heave GC cycle and then freeze and die like what is
shown below. This is under stock Xen 4.14.0 on a pretty unremarkable
x86_64 box made by Supermicro.

I would really appreciate any thoughts on the subject or at least directions
in which I should go to investigate this. At this point -- this part
of Xen is a
bit of a mystery to me -- but I'm very much willing to learn ;-)

From my completely uneducated guess it feels like some kind of an issue
between DomU shuffling memory much more than normal and Xen somehow
getting unhappy about that:

[376900.874560] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [dotnet:3518]
[376900.874764] Kernel panic - not syncing: softlockup: hung tasks
[376900.874793] CPU: 0 PID: 3518 Comm: dotnet Tainted: G L
4.15.0-112-generic #113-Ubuntu
[376900.874824] Hardware name: Xen HVM domU, BIOS 4.14.0 12/15/2020
[376900.874847] Call Trace:
[376900.874860] <IRQ>
[376900.874874] dump_stack+0x6d/0x8e
[376900.874892] panic+0xe4/0x254
[376900.874911] watchdog_timer_fn+0x21e/0x230
[376900.874928] ? watchdog+0x30/0x30
[376900.874947] __hrtimer_run_queues+0xdf/0x230
[376900.874970] hrtimer_interrupt+0xa0/0x1d0
[376900.874989] xen_timer_interrupt+0x20/0x30
[376900.875008] __handle_irq_event_percpu+0x44/0x1a0
[376900.875031] handle_irq_event_percpu+0x32/0x80
[376900.875053] handle_percpu_irq+0x3d/0x60
[376900.875071] generic_handle_irq+0x28/0x40
[376900.875090] __evtchn_fifo_handle_events+0x172/0x190
[376900.875112] evtchn_fifo_handle_events+0x10/0x20
[376900.875133] __xen_evtchn_do_upcall+0x49/0x80
[376900.875156] xen_evtchn_do_upcall+0x2b/0x50
[376900.875177] xen_hvm_callback_vector+0x90/0xa0
[376900.875197] </IRQ>
[376900.875211] RIP: 0010:smp_call_function_single+0xdc/0x100
[376900.875230] RSP: 0018:ffffaaa3c1807c20 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff0c
[376900.875261] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
0000000000000000
[376900.875288] RDX: 0000000000000001 RSI: 0000000000000003 RDI:
0000000000000003
[376900.875314] RBP: ffffaaa3c1807c70 R08: fffffffffffffffc R09:
0000000000000002
[376900.875341] R10: 0000000000000040 R11: 0000000000000000 R12:
ffff8e0ab2c1de70
[376900.875368] R13: 0000000000000000 R14: ffffffff95a7ecd0 R15:
ffffaaa3c1807d08
[376900.875396] ? flush_tlb_func_common.constprop.10+0x230/0x230
[376900.875424] ? flush_tlb_func_common.constprop.10+0x230/0x230
[376900.875449] ? unmap_page_range+0xbbc/0xd00
[376900.875470] smp_call_function_many+0x1cc/0x250
[376900.875491] ? smp_call_function_many+0x1cc/0x250
[376900.875513] native_flush_tlb_others+0x3c/0xf0
[376900.875534] flush_tlb_mm_range+0xae/0x110
[376900.875552] tlb_flush_mmu_tlbonly+0x5f/0xc0
[376900.875574] arch_tlb_finish_mmu+0x3f/0x80
[376900.875592] tlb_finish_mmu+0x23/0x30
[376900.875610] unmap_region+0xf7/0x130
[376900.875629] do_munmap+0x276/0x450
[376900.875647] vm_munmap+0x69/0xb0
[376900.875664] SyS_munmap+0x22/0x30
[376900.875682] do_syscall_64+0x73/0x130
[376900.875701] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[376900.875721] RIP: 0033:0x7f05ad52dd59
[376900.875737] RSP: 002b:00007f05a8037150 EFLAGS: 00000246 ORIG_RAX:
000000000000000b
[376900.875765] RAX: ffffffffffffffda RBX: 000056517e2a08c0 RCX:
00007f05ad52dd59
[376900.875791] RDX: 0000000000000000 RSI: 0000000000006a00 RDI:
00007f05aad8f000
[376900.875818] RBP: 0000000000006a00 R08: 0000000000020b18 R09:
0000000000000000
[376900.875844] R10: 0000000000020ad0 R11: 0000000000000246 R12:
0000000000000001
[376900.875870] R13: 0000000000000000 R14: 000056517eb02300 R15:
00007f05aad8f000

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-06 20:03 Linux DomU freezes and dies under heavy memory shuffling Roman Shaposhnik
@ 2021-02-16 20:34 ` Stefano Stabellini
  2021-02-17  6:47   ` Jürgen Groß
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Stabellini @ 2021-02-16 20:34 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: Xen-devel, jbeulich, andrew.cooper3, roger.pau, wl,
	george.dunlap, sstabellini

+ x86 maintainers

It looks like the tlbflush is getting stuck?


On Sat, 6 Feb 2021, Roman Shaposhnik wrote:
> Hi!
> 
> all of a sudden (but only after a few days of running normally), on a stock
> Ubuntu 18.04 (Bionic with 4.15.0 kernel) DomU I'm seeing Microsoft's .net
> runtime go into a heave GC cycle and then freeze and die like what is
> shown below. This is under stock Xen 4.14.0 on a pretty unremarkable
> x86_64 box made by Supermicro.
> 
> I would really appreciate any thoughts on the subject or at least directions
> in which I should go to investigate this. At this point -- this part
> of Xen is a
> bit of a mystery to me -- but I'm very much willing to learn ;-)
> 
> >From my completely uneducated guess it feels like some kind of an issue
> between DomU shuffling memory much more than normal and Xen somehow
> getting unhappy about that:
> 
> [376900.874560] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [dotnet:3518]
> [376900.874764] Kernel panic - not syncing: softlockup: hung tasks
> [376900.874793] CPU: 0 PID: 3518 Comm: dotnet Tainted: G L
> 4.15.0-112-generic #113-Ubuntu
> [376900.874824] Hardware name: Xen HVM domU, BIOS 4.14.0 12/15/2020
> [376900.874847] Call Trace:
> [376900.874860] <IRQ>
> [376900.874874] dump_stack+0x6d/0x8e
> [376900.874892] panic+0xe4/0x254
> [376900.874911] watchdog_timer_fn+0x21e/0x230
> [376900.874928] ? watchdog+0x30/0x30
> [376900.874947] __hrtimer_run_queues+0xdf/0x230
> [376900.874970] hrtimer_interrupt+0xa0/0x1d0
> [376900.874989] xen_timer_interrupt+0x20/0x30
> [376900.875008] __handle_irq_event_percpu+0x44/0x1a0
> [376900.875031] handle_irq_event_percpu+0x32/0x80
> [376900.875053] handle_percpu_irq+0x3d/0x60
> [376900.875071] generic_handle_irq+0x28/0x40
> [376900.875090] __evtchn_fifo_handle_events+0x172/0x190
> [376900.875112] evtchn_fifo_handle_events+0x10/0x20
> [376900.875133] __xen_evtchn_do_upcall+0x49/0x80
> [376900.875156] xen_evtchn_do_upcall+0x2b/0x50
> [376900.875177] xen_hvm_callback_vector+0x90/0xa0
> [376900.875197] </IRQ>
> [376900.875211] RIP: 0010:smp_call_function_single+0xdc/0x100
> [376900.875230] RSP: 0018:ffffaaa3c1807c20 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffff0c
> [376900.875261] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [376900.875288] RDX: 0000000000000001 RSI: 0000000000000003 RDI:
> 0000000000000003
> [376900.875314] RBP: ffffaaa3c1807c70 R08: fffffffffffffffc R09:
> 0000000000000002
> [376900.875341] R10: 0000000000000040 R11: 0000000000000000 R12:
> ffff8e0ab2c1de70
> [376900.875368] R13: 0000000000000000 R14: ffffffff95a7ecd0 R15:
> ffffaaa3c1807d08
> [376900.875396] ? flush_tlb_func_common.constprop.10+0x230/0x230
> [376900.875424] ? flush_tlb_func_common.constprop.10+0x230/0x230
> [376900.875449] ? unmap_page_range+0xbbc/0xd00
> [376900.875470] smp_call_function_many+0x1cc/0x250
> [376900.875491] ? smp_call_function_many+0x1cc/0x250
> [376900.875513] native_flush_tlb_others+0x3c/0xf0
> [376900.875534] flush_tlb_mm_range+0xae/0x110
> [376900.875552] tlb_flush_mmu_tlbonly+0x5f/0xc0
> [376900.875574] arch_tlb_finish_mmu+0x3f/0x80
> [376900.875592] tlb_finish_mmu+0x23/0x30
> [376900.875610] unmap_region+0xf7/0x130
> [376900.875629] do_munmap+0x276/0x450
> [376900.875647] vm_munmap+0x69/0xb0
> [376900.875664] SyS_munmap+0x22/0x30
> [376900.875682] do_syscall_64+0x73/0x130
> [376900.875701] entry_SYSCALL_64_after_hwframe+0x41/0xa6
> [376900.875721] RIP: 0033:0x7f05ad52dd59
> [376900.875737] RSP: 002b:00007f05a8037150 EFLAGS: 00000246 ORIG_RAX:
> 000000000000000b
> [376900.875765] RAX: ffffffffffffffda RBX: 000056517e2a08c0 RCX:
> 00007f05ad52dd59
> [376900.875791] RDX: 0000000000000000 RSI: 0000000000006a00 RDI:
> 00007f05aad8f000
> [376900.875818] RBP: 0000000000006a00 R08: 0000000000020b18 R09:
> 0000000000000000
> [376900.875844] R10: 0000000000020ad0 R11: 0000000000000246 R12:
> 0000000000000001
> [376900.875870] R13: 0000000000000000 R14: 000056517eb02300 R15:
> 00007f05aad8f000
> 
> Thanks,
> Roman.
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-16 20:34 ` Stefano Stabellini
@ 2021-02-17  6:47   ` Jürgen Groß
  2021-02-17  8:12     ` Roman Shaposhnik
  0 siblings, 1 reply; 14+ messages in thread
From: Jürgen Groß @ 2021-02-17  6:47 UTC (permalink / raw)
  To: Stefano Stabellini, Roman Shaposhnik
  Cc: Xen-devel, jbeulich, andrew.cooper3, roger.pau, wl, george.dunlap


[-- Attachment #1.1.1: Type: text/plain, Size: 691 bytes --]

On 16.02.21 21:34, Stefano Stabellini wrote:
> + x86 maintainers
> 
> It looks like the tlbflush is getting stuck?

I have seen this case multiple times on customer systems now, but
reproducing it reliably seems to be very hard.

I suspected fifo events to be blamed, but just yesterday I've been
informed of another case with fifo events disabled in the guest.

One common pattern seems to be that up to now I have seen this effect
only on systems with Intel Gold cpus. Can it be confirmed to be true
in this case, too?

In case anybody has a reproducer (either in a guest or dom0) with a
setup where a diagnostic kernel can be used, I'd be _very_ interested!


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-17  6:47   ` Jürgen Groß
@ 2021-02-17  8:12     ` Roman Shaposhnik
  2021-02-17  8:29       ` Jürgen Groß
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2021-02-17  8:12 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap

Hi Jürgen, thanks for taking a look at this. A few comments below:

On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com> wrote:
>
> On 16.02.21 21:34, Stefano Stabellini wrote:
> > + x86 maintainers
> >
> > It looks like the tlbflush is getting stuck?
>
> I have seen this case multiple times on customer systems now, but
> reproducing it reliably seems to be very hard.

It is reliably reproducible under my workload but it take a long time
(~3 days of the workload running in the lab).

> I suspected fifo events to be blamed, but just yesterday I've been
> informed of another case with fifo events disabled in the guest.
>
> One common pattern seems to be that up to now I have seen this effect
> only on systems with Intel Gold cpus. Can it be confirmed to be true
> in this case, too?

I am pretty sure mine isn't -- I can get you full CPU specs if that's useful.

> In case anybody has a reproducer (either in a guest or dom0) with a
> setup where a diagnostic kernel can be used, I'd be _very_ interested!

I can easily add things to Dom0 and DomU. Whether that will disrupt the
experiment is, of course, another matter. Still please let me know what
would be helpful to do.

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-17  8:12     ` Roman Shaposhnik
@ 2021-02-17  8:29       ` Jürgen Groß
  2021-02-18  5:21         ` Roman Shaposhnik
  0 siblings, 1 reply; 14+ messages in thread
From: Jürgen Groß @ 2021-02-17  8:29 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap


[-- Attachment #1.1.1: Type: text/plain, Size: 1772 bytes --]

On 17.02.21 09:12, Roman Shaposhnik wrote:
> Hi Jürgen, thanks for taking a look at this. A few comments below:
> 
> On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com> wrote:
>>
>> On 16.02.21 21:34, Stefano Stabellini wrote:
>>> + x86 maintainers
>>>
>>> It looks like the tlbflush is getting stuck?
>>
>> I have seen this case multiple times on customer systems now, but
>> reproducing it reliably seems to be very hard.
> 
> It is reliably reproducible under my workload but it take a long time
> (~3 days of the workload running in the lab).

This is by far the best reproduction rate I have seen up to now.

The next best reproducer seems to be a huge installation with several
hundred hosts and thousands of VMs with about 1 crash each week.

> 
>> I suspected fifo events to be blamed, but just yesterday I've been
>> informed of another case with fifo events disabled in the guest.
>>
>> One common pattern seems to be that up to now I have seen this effect
>> only on systems with Intel Gold cpus. Can it be confirmed to be true
>> in this case, too?
> 
> I am pretty sure mine isn't -- I can get you full CPU specs if that's useful.

Just the output of "grep model /proc/cpuinfo" should be enough.

> 
>> In case anybody has a reproducer (either in a guest or dom0) with a
>> setup where a diagnostic kernel can be used, I'd be _very_ interested!
> 
> I can easily add things to Dom0 and DomU. Whether that will disrupt the
> experiment is, of course, another matter. Still please let me know what
> would be helpful to do.

Is there a chance to switch to an upstream kernel in the guest? I'd like
to add some diagnostic code to the kernel and creating the patches will
be easier this way.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-17  8:29       ` Jürgen Groß
@ 2021-02-18  5:21         ` Roman Shaposhnik
  2021-02-18  9:34           ` Jürgen Groß
  2021-02-23 13:17           ` Jürgen Groß
  0 siblings, 2 replies; 14+ messages in thread
From: Roman Shaposhnik @ 2021-02-18  5:21 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap

[-- Attachment #1: Type: text/plain, Size: 3444 bytes --]

On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com> wrote:

> On 17.02.21 09:12, Roman Shaposhnik wrote:
> > Hi Jürgen, thanks for taking a look at this. A few comments below:
> >
> > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com> wrote:
> >>
> >> On 16.02.21 21:34, Stefano Stabellini wrote:
> >>> + x86 maintainers
> >>>
> >>> It looks like the tlbflush is getting stuck?
> >>
> >> I have seen this case multiple times on customer systems now, but
> >> reproducing it reliably seems to be very hard.
> >
> > It is reliably reproducible under my workload but it take a long time
> > (~3 days of the workload running in the lab).
>
> This is by far the best reproduction rate I have seen up to now.
>
> The next best reproducer seems to be a huge installation with several
> hundred hosts and thousands of VMs with about 1 crash each week.
>
> >
> >> I suspected fifo events to be blamed, but just yesterday I've been
> >> informed of another case with fifo events disabled in the guest.
> >>
> >> One common pattern seems to be that up to now I have seen this effect
> >> only on systems with Intel Gold cpus. Can it be confirmed to be true
> >> in this case, too?
> >
> > I am pretty sure mine isn't -- I can get you full CPU specs if that's
> useful.
>
> Just the output of "grep model /proc/cpuinfo" should be enough.
>

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 77
model name : Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
stepping : 8
microcode : 0x12d
cpu MHz : 1200.070
cache size : 1024 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm
3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi
flexpriority ept vpid tsc_adjust smep erms dtherm ida arat md_clear
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority
tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
bogomips : 4800.19
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:


> >
> >> In case anybody has a reproducer (either in a guest or dom0) with a
> >> setup where a diagnostic kernel can be used, I'd be _very_ interested!
> >
> > I can easily add things to Dom0 and DomU. Whether that will disrupt the
> > experiment is, of course, another matter. Still please let me know what
> > would be helpful to do.
>
> Is there a chance to switch to an upstream kernel in the guest? I'd like
> to add some diagnostic code to the kernel and creating the patches will
> be easier this way.
>

That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade the
kernel I'll have fiddle with a lot things to make workload functional again.

However, I can install debug kernel (from Ubuntu, etc. etc.)

Of course, if patching the kernel is the only way to make progress -- lets
try that -- please let me know.

Thanks,
Roman.

[-- Attachment #2: Type: text/html, Size: 6355 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-18  5:21         ` Roman Shaposhnik
@ 2021-02-18  9:34           ` Jürgen Groß
  2021-02-23 13:17           ` Jürgen Groß
  1 sibling, 0 replies; 14+ messages in thread
From: Jürgen Groß @ 2021-02-18  9:34 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap


[-- Attachment #1.1.1: Type: text/plain, Size: 4199 bytes --]

On 18.02.21 06:21, Roman Shaposhnik wrote:
> On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com 
> <mailto:jgross@suse.com>> wrote:
> 
>     On 17.02.21 09:12, Roman Shaposhnik wrote:
>      > Hi Jürgen, thanks for taking a look at this. A few comments below:
>      >
>      > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
>     <mailto:jgross@suse.com>> wrote:
>      >>
>      >> On 16.02.21 21:34, Stefano Stabellini wrote:
>      >>> + x86 maintainers
>      >>>
>      >>> It looks like the tlbflush is getting stuck?
>      >>
>      >> I have seen this case multiple times on customer systems now, but
>      >> reproducing it reliably seems to be very hard.
>      >
>      > It is reliably reproducible under my workload but it take a long time
>      > (~3 days of the workload running in the lab).
> 
>     This is by far the best reproduction rate I have seen up to now.
> 
>     The next best reproducer seems to be a huge installation with several
>     hundred hosts and thousands of VMs with about 1 crash each week.
> 
>      >
>      >> I suspected fifo events to be blamed, but just yesterday I've been
>      >> informed of another case with fifo events disabled in the guest.
>      >>
>      >> One common pattern seems to be that up to now I have seen this
>     effect
>      >> only on systems with Intel Gold cpus. Can it be confirmed to be true
>      >> in this case, too?
>      >
>      > I am pretty sure mine isn't -- I can get you full CPU specs if
>     that's useful.
> 
>     Just the output of "grep model /proc/cpuinfo" should be enough.
> 
> 
> processor: 3
> vendor_id: GenuineIntel
> cpu family: 6
> model: 77
> model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
> stepping: 8
> microcode: 0x12d
> cpu MHz: 1200.070
> cache size: 1024 KB
> physical id: 0
> siblings: 4
> core id: 3
> cpu cores: 4
> apicid: 6
> initial apicid: 6
> fpu: yes
> fpu_exception: yes
> cpuid level: 11
> wp: yes
> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp 
> lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est 
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer 
> aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp 
> tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida 
> arat md_clear
> vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority 
> tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
> bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
> bogomips: 4800.19
> clflush size: 64
> cache_alignment: 64
> address sizes: 36 bits physical, 48 bits virtual
> power management:
> 
>      >
>      >> In case anybody has a reproducer (either in a guest or dom0) with a
>      >> setup where a diagnostic kernel can be used, I'd be _very_
>     interested!
>      >
>      > I can easily add things to Dom0 and DomU. Whether that will
>     disrupt the
>      > experiment is, of course, another matter. Still please let me
>     know what
>      > would be helpful to do.
> 
>     Is there a chance to switch to an upstream kernel in the guest? I'd like
>     to add some diagnostic code to the kernel and creating the patches will
>     be easier this way.
> 
> 
> That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade 
> the kernel I'll have fiddle with a lot things to make workload 
> functional again.
> 
> However, I can install debug kernel (from Ubuntu, etc. etc.)
> 
> Of course, if patching the kernel is the only way to make progress -- 
> lets try that -- please let me know.

I have found a nice upstream patch, which - with some modifications - I
plan to give our customer as a workaround.

The patch is for kernel 4.12, but chances are good it will apply to a
4.15 kernel, too.

Are you able to give it a try? I hope it will fix the hangs, but in#case 
of a fixed situation there should be a message on the console.


Juergen

[-- Attachment #1.1.2: 0001-kernel-smp-Provide-CSD-lock-timeout-diagnostics.patch --]
[-- Type: text/x-patch, Size: 6378 bytes --]

From: "Juergen Gross" <jgross@suse.com>
Date: Thu, 18 Feb 2021 09:22:54 +0100
Subject: [PATCH] kernel/smp: Provide CSD lock timeout diagnostics

This commit causes csd_lock_wait() to emit diagnostics when a CPU
fails to respond quickly enough to one of the smp_call_function()
family of function calls.

In case such a stall is detected the cpu which ought to execute the
function will be pinged again in case the IPI somehow got lost.

This commit is based on an upstream patch by Paul E. McKenney.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
diff --git a/kernel/smp.c b/kernel/smp.c
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,9 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/sched/clock.h>
+#include <linux/nmi.h>
+#include <linux/sched/debug.h>
 
 #include "smpboot.h"
 
@@ -96,6 +99,79 @@ void __init call_function_init(void)
 	smpcfd_prepare_cpu(smp_processor_id());
 }
 
+static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
+static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
+static DEFINE_PER_CPU(void *, cur_csd_info);
+
+#define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
+atomic_t csd_bug_count = ATOMIC_INIT(0);
+
+/* Record current CSD work for current CPU, NULL to erase. */
+static void csd_lock_record(call_single_data_t *csd)
+{
+	if (!csd) {
+		smp_mb(); /* NULL cur_csd after unlock. */
+		__this_cpu_write(cur_csd, NULL);
+		return;
+	}
+	__this_cpu_write(cur_csd_func, csd->func);
+	__this_cpu_write(cur_csd_info, csd->info);
+	smp_wmb(); /* func and info before csd. */
+	__this_cpu_write(cur_csd, csd);
+	smp_mb(); /* Update cur_csd before function call. */
+		  /* Or before unlock, as the case may be. */
+}
+
+/* Complain if too much time spent waiting. */
+static __always_inline bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned int cpu)
+{
+	bool firsttime;
+	u64 ts2, ts_delta;
+	call_single_data_t *cpu_cur_csd;
+	unsigned int flags = READ_ONCE(csd->flags);
+
+	if (!(flags & CSD_FLAG_LOCK)) {
+		if (!unlikely(*bug_id))
+			return true;
+		pr_alert("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock.\n",
+			 *bug_id, raw_smp_processor_id(), cpu);
+		return true;
+	}
+
+	ts2 = sched_clock();
+	ts_delta = ts2 - *ts1;
+	if (likely(ts_delta <= CSD_LOCK_TIMEOUT))
+		return false;
+
+	firsttime = !*bug_id;
+	if (firsttime)
+		*bug_id = atomic_inc_return(&csd_bug_count);
+	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpu)); /* Before func and info. */
+	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 cpu, csd->func, csd->info);
+	if (cpu_cur_csd && csd != cpu_cur_csd) {
+		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
+			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpu)),
+			 READ_ONCE(per_cpu(cur_csd_info, cpu)));
+	} else {
+		pr_alert("\tcsd: CSD lock (#%d) %s.\n",
+			 *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
+	}
+	if (cpu >= 0) {
+		if (!trigger_single_cpu_backtrace(cpu))
+			dump_cpu_task(cpu);
+		if (!cpu_cur_csd) {
+			pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu);
+			arch_send_call_function_single_ipi(cpu);
+		}
+	}
+	dump_stack();
+	*ts1 = ts2;
+
+	return false;
+}
+
 /*
  * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
  *
@@ -103,14 +179,23 @@ void __init call_function_init(void)
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
+static __always_inline void csd_lock_wait(call_single_data_t *csd, unsigned int cpu)
 {
-	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
+	int bug_id = 0;
+	u64 ts0, ts1;
+
+	ts1 = ts0 = sched_clock();
+	for (;;) {
+		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, cpu))
+			break;
+		cpu_relax();
+	}
+	smp_acquire__after_ctrl_dep();
 }
 
-static __always_inline void csd_lock(call_single_data_t *csd)
+static __always_inline void csd_lock(call_single_data_t *csd, unsigned int cpu)
 {
-	csd_lock_wait(csd);
+	csd_lock_wait(csd, cpu);
 	csd->flags |= CSD_FLAG_LOCK;
 
 	/*
@@ -148,9 +233,11 @@ static int generic_exec_single(int cpu,
 		 * We can unlock early even for the synchronous on-stack case,
 		 * since we're doing this from the same CPU..
 		 */
+		csd_lock_record(csd);
 		csd_unlock(csd);
 		local_irq_save(flags);
 		func(info);
+		csd_lock_record(NULL);
 		local_irq_restore(flags);
 		return 0;
 	}
@@ -238,6 +325,7 @@ static void flush_smp_call_function_queu
 		smp_call_func_t func = csd->func;
 		void *info = csd->info;
 
+		csd_lock_record(csd);
 		/* Do we wait until *after* callback? */
 		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
 			func(info);
@@ -246,6 +334,7 @@ static void flush_smp_call_function_queu
 			csd_unlock(csd);
 			func(info);
 		}
+		csd_lock_record(NULL);
 	}
 
 	/*
@@ -293,13 +382,13 @@ int smp_call_function_single(int cpu, sm
 	csd = &csd_stack;
 	if (!wait) {
 		csd = this_cpu_ptr(&csd_data);
-		csd_lock(csd);
+		csd_lock(csd, cpu);
 	}
 
 	err = generic_exec_single(cpu, csd, func, info);
 
 	if (wait)
-		csd_lock_wait(csd);
+		csd_lock_wait(csd, cpu);
 
 	put_cpu();
 
@@ -331,7 +420,7 @@ int smp_call_function_single_async(int c
 
 	/* We could deadlock if we have to wait here with interrupts disabled! */
 	if (WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK))
-		csd_lock_wait(csd);
+		csd_lock_wait(csd, cpu);
 
 	csd->flags = CSD_FLAG_LOCK;
 	smp_wmb();
@@ -448,7 +537,7 @@ void smp_call_function_many(const struct
 	for_each_cpu(cpu, cfd->cpumask) {
 		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
 
-		csd_lock(csd);
+		csd_lock(csd, cpu);
 		if (wait)
 			csd->flags |= CSD_FLAG_SYNCHRONOUS;
 		csd->func = func;
@@ -465,7 +554,7 @@ void smp_call_function_many(const struct
 			call_single_data_t *csd;
 
 			csd = per_cpu_ptr(cfd->csd, cpu);
-			csd_lock_wait(csd);
+			csd_lock_wait(csd, cpu);
 		}
 	}
 }

[-- Attachment #1.1.3: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-18  5:21         ` Roman Shaposhnik
  2021-02-18  9:34           ` Jürgen Groß
@ 2021-02-23 13:17           ` Jürgen Groß
  2021-02-25  3:06             ` Roman Shaposhnik
  1 sibling, 1 reply; 14+ messages in thread
From: Jürgen Groß @ 2021-02-23 13:17 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap


[-- Attachment #1.1.1: Type: text/plain, Size: 4627 bytes --]

On 18.02.21 06:21, Roman Shaposhnik wrote:
> On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com 
> <mailto:jgross@suse.com>> wrote:
> 
>     On 17.02.21 09:12, Roman Shaposhnik wrote:
>      > Hi Jürgen, thanks for taking a look at this. A few comments below:
>      >
>      > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
>     <mailto:jgross@suse.com>> wrote:
>      >>
>      >> On 16.02.21 21:34, Stefano Stabellini wrote:
>      >>> + x86 maintainers
>      >>>
>      >>> It looks like the tlbflush is getting stuck?
>      >>
>      >> I have seen this case multiple times on customer systems now, but
>      >> reproducing it reliably seems to be very hard.
>      >
>      > It is reliably reproducible under my workload but it take a long time
>      > (~3 days of the workload running in the lab).
> 
>     This is by far the best reproduction rate I have seen up to now.
> 
>     The next best reproducer seems to be a huge installation with several
>     hundred hosts and thousands of VMs with about 1 crash each week.
> 
>      >
>      >> I suspected fifo events to be blamed, but just yesterday I've been
>      >> informed of another case with fifo events disabled in the guest.
>      >>
>      >> One common pattern seems to be that up to now I have seen this
>     effect
>      >> only on systems with Intel Gold cpus. Can it be confirmed to be true
>      >> in this case, too?
>      >
>      > I am pretty sure mine isn't -- I can get you full CPU specs if
>     that's useful.
> 
>     Just the output of "grep model /proc/cpuinfo" should be enough.
> 
> 
> processor: 3
> vendor_id: GenuineIntel
> cpu family: 6
> model: 77
> model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
> stepping: 8
> microcode: 0x12d
> cpu MHz: 1200.070
> cache size: 1024 KB
> physical id: 0
> siblings: 4
> core id: 3
> cpu cores: 4
> apicid: 6
> initial apicid: 6
> fpu: yes
> fpu_exception: yes
> cpuid level: 11
> wp: yes
> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp 
> lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est 
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer 
> aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp 
> tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida 
> arat md_clear
> vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority 
> tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
> bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
> bogomips: 4800.19
> clflush size: 64
> cache_alignment: 64
> address sizes: 36 bits physical, 48 bits virtual
> power management:
> 
>      >
>      >> In case anybody has a reproducer (either in a guest or dom0) with a
>      >> setup where a diagnostic kernel can be used, I'd be _very_
>     interested!
>      >
>      > I can easily add things to Dom0 and DomU. Whether that will
>     disrupt the
>      > experiment is, of course, another matter. Still please let me
>     know what
>      > would be helpful to do.
> 
>     Is there a chance to switch to an upstream kernel in the guest? I'd like
>     to add some diagnostic code to the kernel and creating the patches will
>     be easier this way.
> 
> 
> That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade 
> the kernel I'll have fiddle with a lot things to make workload 
> functional again.
> 
> However, I can install debug kernel (from Ubuntu, etc. etc.)
> 
> Of course, if patching the kernel is the only way to make progress -- 
> lets try that -- please let me know.

I have been able to gather some more data.

I have contacted the author of the upstream kernel patch I've been using
for our customer (and that helped, by the way).

It seems as if the problem is occurring when running as a guest at least
under Xen, KVM, and VMWare, and there have been reports of bare metal
cases, too. Hunting this bug is going on for several years now, the
patch author is at it since 8 months.

So we can rule out a Xen problem.

Finding the root cause is still important, of course, and your setup
seems to have the best reproduction rate up to now.

So any help would really be appreciated.

Is the VM self contained? Would it be possible to start it e.g. on a
test system on my side? If yes, would you be allowed to pass it on to
me?


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-23 13:17           ` Jürgen Groß
@ 2021-02-25  3:06             ` Roman Shaposhnik
  2021-02-25  3:44               ` Elliott Mitchell
  2021-03-12 21:33               ` Roman Shaposhnik
  0 siblings, 2 replies; 14+ messages in thread
From: Roman Shaposhnik @ 2021-02-25  3:06 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap

Hi Jürgen!

sorry for the belated reply -- I wanted to externalize the VM before I
do -- but let me at least reply to you:

On Tue, Feb 23, 2021 at 5:17 AM Jürgen Groß <jgross@suse.com> wrote:
>
> On 18.02.21 06:21, Roman Shaposhnik wrote:
> > On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com
> > <mailto:jgross@suse.com>> wrote:
> >
> >     On 17.02.21 09:12, Roman Shaposhnik wrote:
> >      > Hi Jürgen, thanks for taking a look at this. A few comments below:
> >      >
> >      > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
> >     <mailto:jgross@suse.com>> wrote:
> >      >>
> >      >> On 16.02.21 21:34, Stefano Stabellini wrote:
> >      >>> + x86 maintainers
> >      >>>
> >      >>> It looks like the tlbflush is getting stuck?
> >      >>
> >      >> I have seen this case multiple times on customer systems now, but
> >      >> reproducing it reliably seems to be very hard.
> >      >
> >      > It is reliably reproducible under my workload but it take a long time
> >      > (~3 days of the workload running in the lab).
> >
> >     This is by far the best reproduction rate I have seen up to now.
> >
> >     The next best reproducer seems to be a huge installation with several
> >     hundred hosts and thousands of VMs with about 1 crash each week.
> >
> >      >
> >      >> I suspected fifo events to be blamed, but just yesterday I've been
> >      >> informed of another case with fifo events disabled in the guest.
> >      >>
> >      >> One common pattern seems to be that up to now I have seen this
> >     effect
> >      >> only on systems with Intel Gold cpus. Can it be confirmed to be true
> >      >> in this case, too?
> >      >
> >      > I am pretty sure mine isn't -- I can get you full CPU specs if
> >     that's useful.
> >
> >     Just the output of "grep model /proc/cpuinfo" should be enough.
> >
> >
> > processor: 3
> > vendor_id: GenuineIntel
> > cpu family: 6
> > model: 77
> > model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
> > stepping: 8
> > microcode: 0x12d
> > cpu MHz: 1200.070
> > cache size: 1024 KB
> > physical id: 0
> > siblings: 4
> > core id: 3
> > cpu cores: 4
> > apicid: 6
> > initial apicid: 6
> > fpu: yes
> > fpu_exception: yes
> > cpuid level: 11
> > wp: yes
> > flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> > pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp
> > lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> > nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
> > tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> > aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp
> > tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida
> > arat md_clear
> > vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority
> > tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
> > bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
> > bogomips: 4800.19
> > clflush size: 64
> > cache_alignment: 64
> > address sizes: 36 bits physical, 48 bits virtual
> > power management:
> >
> >      >
> >      >> In case anybody has a reproducer (either in a guest or dom0) with a
> >      >> setup where a diagnostic kernel can be used, I'd be _very_
> >     interested!
> >      >
> >      > I can easily add things to Dom0 and DomU. Whether that will
> >     disrupt the
> >      > experiment is, of course, another matter. Still please let me
> >     know what
> >      > would be helpful to do.
> >
> >     Is there a chance to switch to an upstream kernel in the guest? I'd like
> >     to add some diagnostic code to the kernel and creating the patches will
> >     be easier this way.
> >
> >
> > That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade
> > the kernel I'll have fiddle with a lot things to make workload
> > functional again.
> >
> > However, I can install debug kernel (from Ubuntu, etc. etc.)
> >
> > Of course, if patching the kernel is the only way to make progress --
> > lets try that -- please let me know.
>
> I have found a nice upstream patch, which - with some modifications - I
> plan to give our customer as a workaround.
>
> The patch is for kernel 4.12, but chances are good it will apply to a
> 4.15 kernel, too.

I'm slightly confused about this patch -- it seems to me that it needs
to be applied to the guest kernel, correct?

If that's the case -- the challenge I have is that I need to re-build
the Canonical (Ubuntu) distro kernel with this patch -- this seems
a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
I just never do it with the vendor ones ;-)).

So... if there's anyone here who has any suggestions on how to do that
-- I'd appreciate pointers.

> I have been able to gather some more data.
>
> I have contacted the author of the upstream kernel patch I've been using
> for our customer (and that helped, by the way).
>
> It seems as if the problem is occurring when running as a guest at least
> under Xen, KVM, and VMWare, and there have been reports of bare metal
> cases, too. Hunting this bug is going on for several years now, the
> patch author is at it since 8 months.
>
> So we can rule out a Xen problem.
>
> Finding the root cause is still important, of course, and your setup
> seems to have the best reproduction rate up to now.
>
> So any help would really be appreciated.
>
> Is the VM self contained? Would it be possible to start it e.g. on a
> test system on my side? If yes, would you be allowed to pass it on to
> me?

I'm working on externalizing the VM in a way that doesn't disclose anything
about the customer workload. I'm almost there -- sans my question about
the vendor kernel rebuild. I plan to make that VM available this week.

Goes without saying, but I would really appreciate your help in chasing this.

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-25  3:06             ` Roman Shaposhnik
@ 2021-02-25  3:44               ` Elliott Mitchell
  2021-02-25  4:30                 ` Roman Shaposhnik
  2021-03-12 21:33               ` Roman Shaposhnik
  1 sibling, 1 reply; 14+ messages in thread
From: Elliott Mitchell @ 2021-02-25  3:44 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: J??rgen Gro??,
	Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monn??,
	Wei Liu, George Dunlap

On Wed, Feb 24, 2021 at 07:06:25PM -0800, Roman Shaposhnik wrote:
> I'm slightly confused about this patch -- it seems to me that it needs
> to be applied to the guest kernel, correct?
> 
> If that's the case -- the challenge I have is that I need to re-build
> the Canonical (Ubuntu) distro kernel with this patch -- this seems
> a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
> I just never do it with the vendor ones ;-)).
> 
> So... if there's anyone here who has any suggestions on how to do that
> -- I'd appreciate pointers.

Generally Debian-derivatives ship the kernel source they use as packages
named "linux-source-<major>.<minor>" (guessing you need
linux-source-5.4?).  They ship their configurations as packages
"linux-config-<major>.<minor>", but they also ship their configuration
with their kernels as /boot/config-<version>.

If you're trying to create a proper packaged kernel, the Linux kernel
Make target "bindeb-pkg" will create an appropriate .deb file.

If you wish to extract a Debian package, they're some tarballs and a
marker file wrapped in a ar archive.  You're likely interested in
control.tar.?z* and data.tar.?z*.  Older packages used gzip-format
(.tar.gz), newer packages use xz-format (.tar.xz).

If you want to extract current Ubuntu kernel source on a different
distribution (or even an unrelated flavor of Unix), likely you would
want `ar p linux-source-5.4.deb data.tar.xz | unxz -c | tar -xf -`.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-25  3:44               ` Elliott Mitchell
@ 2021-02-25  4:30                 ` Roman Shaposhnik
  2021-02-25  4:47                   ` Elliott Mitchell
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2021-02-25  4:30 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: J??rgen Gro??,
	Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monn??,
	Wei Liu, George Dunlap

On Wed, Feb 24, 2021 at 7:44 PM Elliott Mitchell <ehem+xen@m5p.com> wrote:
>
> On Wed, Feb 24, 2021 at 07:06:25PM -0800, Roman Shaposhnik wrote:
> > I'm slightly confused about this patch -- it seems to me that it needs
> > to be applied to the guest kernel, correct?
> >
> > If that's the case -- the challenge I have is that I need to re-build
> > the Canonical (Ubuntu) distro kernel with this patch -- this seems
> > a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
> > I just never do it with the vendor ones ;-)).
> >
> > So... if there's anyone here who has any suggestions on how to do that
> > -- I'd appreciate pointers.
>
> Generally Debian-derivatives ship the kernel source they use as packages
> named "linux-source-<major>.<minor>" (guessing you need
> linux-source-5.4?).  They ship their configurations as packages
> "linux-config-<major>.<minor>", but they also ship their configuration
> with their kernels as /boot/config-<version>.
>
> If you're trying to create a proper packaged kernel, the Linux kernel
> Make target "bindeb-pkg" will create an appropriate .deb file.

Right -- but that's not what distro builders use, right? I mean they do
the whole sdeb -> deb business.

In fact, to stay as faithful as possible -- I'd love to:
   1. unpack SDEB
   2. add a single patch to the set of sources
   3. repack SDEB back
   4. do whatever it is they do to go SDEB -> DEB

Thanks,
Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-25  4:30                 ` Roman Shaposhnik
@ 2021-02-25  4:47                   ` Elliott Mitchell
  0 siblings, 0 replies; 14+ messages in thread
From: Elliott Mitchell @ 2021-02-25  4:47 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: J??rgen Gro??,
	Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monn??,
	Wei Liu, George Dunlap

On Wed, Feb 24, 2021 at 08:30:45PM -0800, Roman Shaposhnik wrote:
> Right -- but that's not what distro builders use, right? I mean they do
> the whole sdeb -> deb business.
> 
> In fact, to stay as faithful as possible -- I'd love to:
>    1. unpack SDEB
>    2. add a single patch to the set of sources
>    3. repack SDEB back
>    4. do whatever it is they do to go SDEB -> DEB

Oh, that close to the original distribution package.  For
Debian-derivatives, install the package "dpkg-dev".

Generally the distribution will have a page somewhere where you can get
the files, but often it is handiest to run `apt-get source <package>` (I
believe `apt source <package>` also works, but I'm used to `apt-get`).
This will grab the tarballs for the source and unpack them.

Go into the unpacked directory and run `dpkg-buildpackage -b`
(optionally, patch first).  This creates the package in the starting
directory.

The tarballs left behind in the starting directory can be nuked or saved.
If saved, the build directory can be recreated by running
`dpkg-source -x <src-package-name>_<ver>.dsc`.  This lets you reset the
build directory to original state.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-02-25  3:06             ` Roman Shaposhnik
  2021-02-25  3:44               ` Elliott Mitchell
@ 2021-03-12 21:33               ` Roman Shaposhnik
  2021-03-13  7:18                 ` Jürgen Groß
  1 sibling, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2021-03-12 21:33 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap

Hi Jürgen,

just wanted to give you (and everyone who may be keeping an eye on
this) an update.

Somehow, after applying your kernel patch -- the VM is now running 10
days+ without a problem.

I'll keep experimenting (A/B-testing style) but at this point I'm
actually pretty perplexed as to why this patch would make a difference
(since it is basically just for observability). Any thoughts on that?

Thanks,
Roman.

On Wed, Feb 24, 2021 at 7:06 PM Roman Shaposhnik <roman@zededa.com> wrote:
>
> Hi Jürgen!
>
> sorry for the belated reply -- I wanted to externalize the VM before I
> do -- but let me at least reply to you:
>
> On Tue, Feb 23, 2021 at 5:17 AM Jürgen Groß <jgross@suse.com> wrote:
> >
> > On 18.02.21 06:21, Roman Shaposhnik wrote:
> > > On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com
> > > <mailto:jgross@suse.com>> wrote:
> > >
> > >     On 17.02.21 09:12, Roman Shaposhnik wrote:
> > >      > Hi Jürgen, thanks for taking a look at this. A few comments below:
> > >      >
> > >      > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
> > >     <mailto:jgross@suse.com>> wrote:
> > >      >>
> > >      >> On 16.02.21 21:34, Stefano Stabellini wrote:
> > >      >>> + x86 maintainers
> > >      >>>
> > >      >>> It looks like the tlbflush is getting stuck?
> > >      >>
> > >      >> I have seen this case multiple times on customer systems now, but
> > >      >> reproducing it reliably seems to be very hard.
> > >      >
> > >      > It is reliably reproducible under my workload but it take a long time
> > >      > (~3 days of the workload running in the lab).
> > >
> > >     This is by far the best reproduction rate I have seen up to now.
> > >
> > >     The next best reproducer seems to be a huge installation with several
> > >     hundred hosts and thousands of VMs with about 1 crash each week.
> > >
> > >      >
> > >      >> I suspected fifo events to be blamed, but just yesterday I've been
> > >      >> informed of another case with fifo events disabled in the guest.
> > >      >>
> > >      >> One common pattern seems to be that up to now I have seen this
> > >     effect
> > >      >> only on systems with Intel Gold cpus. Can it be confirmed to be true
> > >      >> in this case, too?
> > >      >
> > >      > I am pretty sure mine isn't -- I can get you full CPU specs if
> > >     that's useful.
> > >
> > >     Just the output of "grep model /proc/cpuinfo" should be enough.
> > >
> > >
> > > processor: 3
> > > vendor_id: GenuineIntel
> > > cpu family: 6
> > > model: 77
> > > model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
> > > stepping: 8
> > > microcode: 0x12d
> > > cpu MHz: 1200.070
> > > cache size: 1024 KB
> > > physical id: 0
> > > siblings: 4
> > > core id: 3
> > > cpu cores: 4
> > > apicid: 6
> > > initial apicid: 6
> > > fpu: yes
> > > fpu_exception: yes
> > > cpuid level: 11
> > > wp: yes
> > > flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> > > pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp
> > > lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> > > nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
> > > tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> > > aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp
> > > tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida
> > > arat md_clear
> > > vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority
> > > tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
> > > bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
> > > bogomips: 4800.19
> > > clflush size: 64
> > > cache_alignment: 64
> > > address sizes: 36 bits physical, 48 bits virtual
> > > power management:
> > >
> > >      >
> > >      >> In case anybody has a reproducer (either in a guest or dom0) with a
> > >      >> setup where a diagnostic kernel can be used, I'd be _very_
> > >     interested!
> > >      >
> > >      > I can easily add things to Dom0 and DomU. Whether that will
> > >     disrupt the
> > >      > experiment is, of course, another matter. Still please let me
> > >     know what
> > >      > would be helpful to do.
> > >
> > >     Is there a chance to switch to an upstream kernel in the guest? I'd like
> > >     to add some diagnostic code to the kernel and creating the patches will
> > >     be easier this way.
> > >
> > >
> > > That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade
> > > the kernel I'll have fiddle with a lot things to make workload
> > > functional again.
> > >
> > > However, I can install debug kernel (from Ubuntu, etc. etc.)
> > >
> > > Of course, if patching the kernel is the only way to make progress --
> > > lets try that -- please let me know.
> >
> > I have found a nice upstream patch, which - with some modifications - I
> > plan to give our customer as a workaround.
> >
> > The patch is for kernel 4.12, but chances are good it will apply to a
> > 4.15 kernel, too.
>
> I'm slightly confused about this patch -- it seems to me that it needs
> to be applied to the guest kernel, correct?
>
> If that's the case -- the challenge I have is that I need to re-build
> the Canonical (Ubuntu) distro kernel with this patch -- this seems
> a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
> I just never do it with the vendor ones ;-)).
>
> So... if there's anyone here who has any suggestions on how to do that
> -- I'd appreciate pointers.
>
> > I have been able to gather some more data.
> >
> > I have contacted the author of the upstream kernel patch I've been using
> > for our customer (and that helped, by the way).
> >
> > It seems as if the problem is occurring when running as a guest at least
> > under Xen, KVM, and VMWare, and there have been reports of bare metal
> > cases, too. Hunting this bug is going on for several years now, the
> > patch author is at it since 8 months.
> >
> > So we can rule out a Xen problem.
> >
> > Finding the root cause is still important, of course, and your setup
> > seems to have the best reproduction rate up to now.
> >
> > So any help would really be appreciated.
> >
> > Is the VM self contained? Would it be possible to start it e.g. on a
> > test system on my side? If yes, would you be allowed to pass it on to
> > me?
>
> I'm working on externalizing the VM in a way that doesn't disclose anything
> about the customer workload. I'm almost there -- sans my question about
> the vendor kernel rebuild. I plan to make that VM available this week.
>
> Goes without saying, but I would really appreciate your help in chasing this.
>
> Thanks,
> Roman.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux DomU freezes and dies under heavy memory shuffling
  2021-03-12 21:33               ` Roman Shaposhnik
@ 2021-03-13  7:18                 ` Jürgen Groß
  0 siblings, 0 replies; 14+ messages in thread
From: Jürgen Groß @ 2021-03-13  7:18 UTC (permalink / raw)
  To: Roman Shaposhnik
  Cc: Stefano Stabellini, Xen-devel, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap


[-- Attachment #1.1.1: Type: text/plain, Size: 7225 bytes --]

On 12.03.21 22:33, Roman Shaposhnik wrote:
> Hi Jürgen,
> 
> just wanted to give you (and everyone who may be keeping an eye on
> this) an update.
> 
> Somehow, after applying your kernel patch -- the VM is now running 10
> days+ without a problem.

Can you check the kernel console messages, please? There are messages
printed when a potential hang is detected, and the hanging cpu is
tried to be woken up via another interrupt again.

Look for messages containing "csd", so e.g. do

dmesg | grep csd

in the VM.

Thanks,


Juergen

> 
> I'll keep experimenting (A/B-testing style) but at this point I'm
> actually pretty perplexed as to why this patch would make a difference
> (since it is basically just for observability). Any thoughts on that?
> 
> Thanks,
> Roman.
> 
> On Wed, Feb 24, 2021 at 7:06 PM Roman Shaposhnik <roman@zededa.com> wrote:
>>
>> Hi Jürgen!
>>
>> sorry for the belated reply -- I wanted to externalize the VM before I
>> do -- but let me at least reply to you:
>>
>> On Tue, Feb 23, 2021 at 5:17 AM Jürgen Groß <jgross@suse.com> wrote:
>>>
>>> On 18.02.21 06:21, Roman Shaposhnik wrote:
>>>> On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgross@suse.com
>>>> <mailto:jgross@suse.com>> wrote:
>>>>
>>>>      On 17.02.21 09:12, Roman Shaposhnik wrote:
>>>>       > Hi Jürgen, thanks for taking a look at this. A few comments below:
>>>>       >
>>>>       > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgross@suse.com
>>>>      <mailto:jgross@suse.com>> wrote:
>>>>       >>
>>>>       >> On 16.02.21 21:34, Stefano Stabellini wrote:
>>>>       >>> + x86 maintainers
>>>>       >>>
>>>>       >>> It looks like the tlbflush is getting stuck?
>>>>       >>
>>>>       >> I have seen this case multiple times on customer systems now, but
>>>>       >> reproducing it reliably seems to be very hard.
>>>>       >
>>>>       > It is reliably reproducible under my workload but it take a long time
>>>>       > (~3 days of the workload running in the lab).
>>>>
>>>>      This is by far the best reproduction rate I have seen up to now.
>>>>
>>>>      The next best reproducer seems to be a huge installation with several
>>>>      hundred hosts and thousands of VMs with about 1 crash each week.
>>>>
>>>>       >
>>>>       >> I suspected fifo events to be blamed, but just yesterday I've been
>>>>       >> informed of another case with fifo events disabled in the guest.
>>>>       >>
>>>>       >> One common pattern seems to be that up to now I have seen this
>>>>      effect
>>>>       >> only on systems with Intel Gold cpus. Can it be confirmed to be true
>>>>       >> in this case, too?
>>>>       >
>>>>       > I am pretty sure mine isn't -- I can get you full CPU specs if
>>>>      that's useful.
>>>>
>>>>      Just the output of "grep model /proc/cpuinfo" should be enough.
>>>>
>>>>
>>>> processor: 3
>>>> vendor_id: GenuineIntel
>>>> cpu family: 6
>>>> model: 77
>>>> model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
>>>> stepping: 8
>>>> microcode: 0x12d
>>>> cpu MHz: 1200.070
>>>> cache size: 1024 KB
>>>> physical id: 0
>>>> siblings: 4
>>>> core id: 3
>>>> cpu cores: 4
>>>> apicid: 6
>>>> initial apicid: 6
>>>> fpu: yes
>>>> fpu_exception: yes
>>>> cpuid level: 11
>>>> wp: yes
>>>> flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
>>>> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp
>>>> lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
>>>> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
>>>> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
>>>> aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp
>>>> tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida
>>>> arat md_clear
>>>> vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority
>>>> tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
>>>> bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
>>>> bogomips: 4800.19
>>>> clflush size: 64
>>>> cache_alignment: 64
>>>> address sizes: 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>>       >
>>>>       >> In case anybody has a reproducer (either in a guest or dom0) with a
>>>>       >> setup where a diagnostic kernel can be used, I'd be _very_
>>>>      interested!
>>>>       >
>>>>       > I can easily add things to Dom0 and DomU. Whether that will
>>>>      disrupt the
>>>>       > experiment is, of course, another matter. Still please let me
>>>>      know what
>>>>       > would be helpful to do.
>>>>
>>>>      Is there a chance to switch to an upstream kernel in the guest? I'd like
>>>>      to add some diagnostic code to the kernel and creating the patches will
>>>>      be easier this way.
>>>>
>>>>
>>>> That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade
>>>> the kernel I'll have fiddle with a lot things to make workload
>>>> functional again.
>>>>
>>>> However, I can install debug kernel (from Ubuntu, etc. etc.)
>>>>
>>>> Of course, if patching the kernel is the only way to make progress --
>>>> lets try that -- please let me know.
>>>
>>> I have found a nice upstream patch, which - with some modifications - I
>>> plan to give our customer as a workaround.
>>>
>>> The patch is for kernel 4.12, but chances are good it will apply to a
>>> 4.15 kernel, too.
>>
>> I'm slightly confused about this patch -- it seems to me that it needs
>> to be applied to the guest kernel, correct?
>>
>> If that's the case -- the challenge I have is that I need to re-build
>> the Canonical (Ubuntu) distro kernel with this patch -- this seems
>> a bit daunting at first (I mean -- I'm pretty good at rebuilding kernels
>> I just never do it with the vendor ones ;-)).
>>
>> So... if there's anyone here who has any suggestions on how to do that
>> -- I'd appreciate pointers.
>>
>>> I have been able to gather some more data.
>>>
>>> I have contacted the author of the upstream kernel patch I've been using
>>> for our customer (and that helped, by the way).
>>>
>>> It seems as if the problem is occurring when running as a guest at least
>>> under Xen, KVM, and VMWare, and there have been reports of bare metal
>>> cases, too. Hunting this bug is going on for several years now, the
>>> patch author is at it since 8 months.
>>>
>>> So we can rule out a Xen problem.
>>>
>>> Finding the root cause is still important, of course, and your setup
>>> seems to have the best reproduction rate up to now.
>>>
>>> So any help would really be appreciated.
>>>
>>> Is the VM self contained? Would it be possible to start it e.g. on a
>>> test system on my side? If yes, would you be allowed to pass it on to
>>> me?
>>
>> I'm working on externalizing the VM in a way that doesn't disclose anything
>> about the customer workload. I'm almost there -- sans my question about
>> the vendor kernel rebuild. I plan to make that VM available this week.
>>
>> Goes without saying, but I would really appreciate your help in chasing this.
>>
>> Thanks,
>> Roman.
> 


[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-13  7:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06 20:03 Linux DomU freezes and dies under heavy memory shuffling Roman Shaposhnik
2021-02-16 20:34 ` Stefano Stabellini
2021-02-17  6:47   ` Jürgen Groß
2021-02-17  8:12     ` Roman Shaposhnik
2021-02-17  8:29       ` Jürgen Groß
2021-02-18  5:21         ` Roman Shaposhnik
2021-02-18  9:34           ` Jürgen Groß
2021-02-23 13:17           ` Jürgen Groß
2021-02-25  3:06             ` Roman Shaposhnik
2021-02-25  3:44               ` Elliott Mitchell
2021-02-25  4:30                 ` Roman Shaposhnik
2021-02-25  4:47                   ` Elliott Mitchell
2021-03-12 21:33               ` Roman Shaposhnik
2021-03-13  7:18                 ` Jürgen Groß

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.