Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.

* Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
@ 2018-12-21 17:54 Hans van Kranenburg
  2018-12-24  0:32 ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-12-21 17:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Igor Yurchenko

Hi,

We've been tracking down a live migration bug during the last three days
here at work, and here's what we found so far.

1. Xen version and dom0 linux kernel version don't matter.
2. DomU kernel is >= Linux 4.13.

When using live migrate to another dom0, this often happens:

[   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
done.
[   37.513316] OOM killer disabled.
[   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
seconds) done.
[   37.514837] suspending xenstore...
[   37.515142] xen:grant_table: Grant tables using version 1 layout
[18446744002.593711] OOM killer enabled.
[18446744002.593726] Restarting tasks ... done.
[18446744002.604527] Setting capacity to 6291456

As a side effect, all open TCP connections stall, because the timestamp
counters of packets sent to the outside world are affected:

https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png

"The problem seems to occur after the domU is resumed. The first packet
(#90) has wrong timestamp value (far from the past), marked red in the
image. Green is the normal sequence of timestamps from the server
(domU), acknowledged by the client. Once client receives the packet from
the past, it attempts re-sending everything from the start. As the
timestamp never reaches normal value, the client goes crazy thinking
that the server has not received anything, keeping the loop on. But they
just exist in different ages."

----------- >8 -----------

Ad 1. We reproduced this on different kinds of HP dl360 g7/8/9 gear,
both with Xen 4.11 / Linux 4.19.9 dom0 kernel and with Xen 4.4 / Linux
3.16 as dom0 kernel.

Ad 2. This was narrowed down by just grabbing old debian kernel images
from https://snapshot.debian.org/binary/?cat=l and trying them.

OK   linux-image-4.12.0-2-amd64_4.12.13-1_amd64.deb
FAIL linux-image-4.13.0-rc5-amd64_4.13~rc5-1~exp1_amd64.deb
FAIL linux-image-4.13.0-trunk-amd64_4.13.1-1~exp1_amd64.deb
FAIL linux-image-4.13.0-1-amd64_4.13.4-1_amd64.deb
FAIL linux-image-4.13.0-1-amd64_4.13.13-1_amd64.deb
FAIL linux-image-4.14.0-3-amd64_4.14.17-1_amd64.deb
FAIL linux-image-4.15.0-3-amd64_4.15.17-1_amd64.deb
FAIL linux-image-4.16.0-2-amd64_4.16.16-2_amd64.deb
FAIL ... everything up to 4.19.9 here

So, there seems to be a change introduced in 4.13 that makes this
behaviour appear. We didn't start compiling old kernels yet to be able
to bisect it further.

----------- >8 -----------

For the rest of the info, I'm focussing on a test environment for
reproduction, which is 4x identical HP DL360G7, named sirius, gamma,
omega and flopsy.

It's running the 4.11 packages from Debian, rebuilt for Stretch:
4.11.1~pre.20180911.5acdd26fdc+dfsg-5~bpo9+1
https://salsa.debian.org/xen-team/debian-xen/commits/stretch-backports

Dom0 kernel is 4.19.9 from Debian, rebuilt for Stretch:
https://salsa.debian.org/knorrie-guest/linux/commits/debian/4.19.9-1_mxbp9+1

xen_commandline : placeholder dom0_max_vcpus=1-4 dom0_mem=4G,max:4G
com2=115200,8n1 console=com2,vga noreboot xpti=no-dom0,domu smt=off

vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
stepping	: 2
microcode	: 0x1f
cpu MHz		: 3066.727

----------- >8 -----------

There are some interesting additional patterns:

1. consistent success / failure paths.

After rebooting all 4 physical servers, starting a domU with 4.19 kernel
and then live migrating it, it might first time fail, or it might
succeed. However, from the first time it fails, the specific direction
of movement keeps showing the failure every single time this combination
is used. Same goes for successful live migrate. E.g.:

sirius -> flopsy OK
sirius -> gamma OK
flopsy -> gamma OK
flopsy -> sirius OK
gamma -> flopsy FAIL
gamma -> sirius FAIL
omega -> flopsy FAIL

After rebooting all of the servers again, and restarting the whole test
procedure, the combinations and results change, but are again consistent
as soon as we start live migrating and seeing results.

2. TCP connections only hang when opened while "timestamp value in dmesg
is low", followed with a "time is 18 gazillion" situation. When opening
a TCP connection to the domU while it's at 18 gazillion seconds uptime,
the TCP connection keeps working all the time after subsequent live
migrations, even when it jumps up and down, following the OK and FAIL paths.

3. Since this is related to time and clocks, the last thing today we
tried was, instead of using default settings, put "clocksource=tsc
tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
domU linux kernel line. What we observed after doing this, is that the
failure happens less often, but still happens. Everything else applies.

----------- >8 -----------

Additional question:

It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
domUs? All our hardware has 'TscInvariant = true'.

Related: https://news.ycombinator.com/item?id=13813079

----------- >8 -----------

I realize this problem might not be caused by Xen itself, but this list
is the most logical place to start asking for help.

Reproducing this in other environments should be pretty easy. 9 out of
10 times it already happens on first live migrate after the domU is started.

We're available to test other stuff or provide more info if needed.

Thanks,
Hans
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread