All of lore.kernel.org
 help / color / mirror / Atom feed
* Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
@ 2018-12-21 17:54 Hans van Kranenburg
  2018-12-24  0:32 ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-12-21 17:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Igor Yurchenko

Hi,

We've been tracking down a live migration bug during the last three days
here at work, and here's what we found so far.

1. Xen version and dom0 linux kernel version don't matter.
2. DomU kernel is >= Linux 4.13.

When using live migrate to another dom0, this often happens:

[   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
done.
[   37.513316] OOM killer disabled.
[   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
seconds) done.
[   37.514837] suspending xenstore...
[   37.515142] xen:grant_table: Grant tables using version 1 layout
[18446744002.593711] OOM killer enabled.
[18446744002.593726] Restarting tasks ... done.
[18446744002.604527] Setting capacity to 6291456

As a side effect, all open TCP connections stall, because the timestamp
counters of packets sent to the outside world are affected:

https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png

"The problem seems to occur after the domU is resumed. The first packet
(#90) has wrong timestamp value (far from the past), marked red in the
image. Green is the normal sequence of timestamps from the server
(domU), acknowledged by the client. Once client receives the packet from
the past, it attempts re-sending everything from the start. As the
timestamp never reaches normal value, the client goes crazy thinking
that the server has not received anything, keeping the loop on. But they
just exist in different ages."

----------- >8 -----------

Ad 1. We reproduced this on different kinds of HP dl360 g7/8/9 gear,
both with Xen 4.11 / Linux 4.19.9 dom0 kernel and with Xen 4.4 / Linux
3.16 as dom0 kernel.

Ad 2. This was narrowed down by just grabbing old debian kernel images
from https://snapshot.debian.org/binary/?cat=l and trying them.

OK   linux-image-4.12.0-2-amd64_4.12.13-1_amd64.deb
FAIL linux-image-4.13.0-rc5-amd64_4.13~rc5-1~exp1_amd64.deb
FAIL linux-image-4.13.0-trunk-amd64_4.13.1-1~exp1_amd64.deb
FAIL linux-image-4.13.0-1-amd64_4.13.4-1_amd64.deb
FAIL linux-image-4.13.0-1-amd64_4.13.13-1_amd64.deb
FAIL linux-image-4.14.0-3-amd64_4.14.17-1_amd64.deb
FAIL linux-image-4.15.0-3-amd64_4.15.17-1_amd64.deb
FAIL linux-image-4.16.0-2-amd64_4.16.16-2_amd64.deb
FAIL ... everything up to 4.19.9 here

So, there seems to be a change introduced in 4.13 that makes this
behaviour appear. We didn't start compiling old kernels yet to be able
to bisect it further.

----------- >8 -----------

For the rest of the info, I'm focussing on a test environment for
reproduction, which is 4x identical HP DL360G7, named sirius, gamma,
omega and flopsy.

It's running the 4.11 packages from Debian, rebuilt for Stretch:
4.11.1~pre.20180911.5acdd26fdc+dfsg-5~bpo9+1
https://salsa.debian.org/xen-team/debian-xen/commits/stretch-backports

Dom0 kernel is 4.19.9 from Debian, rebuilt for Stretch:
https://salsa.debian.org/knorrie-guest/linux/commits/debian/4.19.9-1_mxbp9+1

xen_commandline : placeholder dom0_max_vcpus=1-4 dom0_mem=4G,max:4G
com2=115200,8n1 console=com2,vga noreboot xpti=no-dom0,domu smt=off

vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
stepping	: 2
microcode	: 0x1f
cpu MHz		: 3066.727

----------- >8 -----------

There are some interesting additional patterns:

1. consistent success / failure paths.

After rebooting all 4 physical servers, starting a domU with 4.19 kernel
and then live migrating it, it might first time fail, or it might
succeed. However, from the first time it fails, the specific direction
of movement keeps showing the failure every single time this combination
is used. Same goes for successful live migrate. E.g.:

sirius -> flopsy OK
sirius -> gamma OK
flopsy -> gamma OK
flopsy -> sirius OK
gamma -> flopsy FAIL
gamma -> sirius FAIL
omega -> flopsy FAIL

After rebooting all of the servers again, and restarting the whole test
procedure, the combinations and results change, but are again consistent
as soon as we start live migrating and seeing results.

2. TCP connections only hang when opened while "timestamp value in dmesg
is low", followed with a "time is 18 gazillion" situation. When opening
a TCP connection to the domU while it's at 18 gazillion seconds uptime,
the TCP connection keeps working all the time after subsequent live
migrations, even when it jumps up and down, following the OK and FAIL paths.

3. Since this is related to time and clocks, the last thing today we
tried was, instead of using default settings, put "clocksource=tsc
tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
domU linux kernel line. What we observed after doing this, is that the
failure happens less often, but still happens. Everything else applies.

----------- >8 -----------

Additional question:

It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
domUs? All our hardware has 'TscInvariant = true'.

Related: https://news.ycombinator.com/item?id=13813079

----------- >8 -----------

I realize this problem might not be caused by Xen itself, but this list
is the most logical place to start asking for help.

Reproducing this in other environments should be pretty easy. 9 out of
10 times it already happens on first live migrate after the domU is started.

We're available to test other stuff or provide more info if needed.

Thanks,
Hans
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2018-12-21 17:54 Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls Hans van Kranenburg
@ 2018-12-24  0:32 ` Hans van Kranenburg
  2018-12-27 21:12   ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-12-24  0:32 UTC (permalink / raw)
  To: Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

Hi,

On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
> Hi,
> 
> We've been tracking down a live migration bug during the last three days
> here at work, and here's what we found so far.
> 
> 1. Xen version and dom0 linux kernel version don't matter.
> 2. DomU kernel is >= Linux 4.13.
> 
> When using live migrate to another dom0, this often happens:
> 
> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
> done.
> [   37.513316] OOM killer disabled.
> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [   37.514837] suspending xenstore...
> [   37.515142] xen:grant_table: Grant tables using version 1 layout
> [18446744002.593711] OOM killer enabled.
> [18446744002.593726] Restarting tasks ... done.
> [18446744002.604527] Setting capacity to 6291456

Tonight, I've been through 29 bisect steps to figure out a bit more. A
make defconfig with enabling Xen PV for domU reproduces the problem
already, so a complete cycle with compiling and testing had only to take
about 7 minutes.

So, it appears that this 18 gazillion seconds of uptime is a thing that
started happening earlier than the TCP situation already. All of the
test scenarios resulted in these huge uptime numbers in dmesg. Not all
of them result in TCP connections hanging.

> As a side effect, all open TCP connections stall, because the timestamp
> counters of packets sent to the outside world are affected:
> 
> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png

This is happening since:

commit 9a568de4818dea9a05af141046bd3e589245ab83
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue May 16 14:00:14 2017 -0700

    tcp: switch TCP TS option (RFC 7323) to 1ms clock

In order to find out, the first 13 bisect steps were to figure out that
live migration was totally broken between...

commit bf22ff45bed664aefb5c4e43029057a199b7070c
Author: Jeffy Chen <jeffy.chen@rock-chips.com>
Date:   Mon Jun 26 19:33:34 2017 +0800

    genirq: Avoid unnecessary low level irq function calls

...and...

commit bb68cfe2f5a7f43058aed299fdbb73eb281734ed
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Jul 31 22:07:09 2017 +0200

    x86/hpet: Cure interface abuse in the resume path

In between are 12k+ commits. So, I restarted bisect and used either
revert of the first commit or cherry-pick of the fix to get a working
test case every single time.

http://paste.debian.net/plainh/be91aabd

> [...]
> 
> 3. Since this is related to time and clocks, the last thing today we
> tried was, instead of using default settings, put "clocksource=tsc
> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
> domU linux kernel line. What we observed after doing this, is that the
> failure happens less often, but still happens. Everything else applies.

Actually, it seems that the important thing is that uptime of the dom0s
is not very close to each other. After rebooting all four back without
tsc options, and then a few hours later rebooting one of them again, I
could easily reproduce again when live migrating to the later rebooted
server.

> Additional question:
> 
> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
> domUs? All our hardware has 'TscInvariant = true'.
> 
> Related: https://news.ycombinator.com/item?id=13813079

This is still interesting.

---- >8 ----

Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
so next task will be to find out where that started.

to be continued...

Hans

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2018-12-24  0:32 ` Hans van Kranenburg
@ 2018-12-27 21:12   ` Hans van Kranenburg
  2018-12-28 10:15     ` Juergen Gross
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-12-27 21:12 UTC (permalink / raw)
  To: Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

So,

On 12/24/18 1:32 AM, Hans van Kranenburg wrote:
> 
> On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
>>
>> We've been tracking down a live migration bug during the last three days
>> here at work, and here's what we found so far.
>>
>> 1. Xen version and dom0 linux kernel version don't matter.
>> 2. DomU kernel is >= Linux 4.13.
>>
>> When using live migrate to another dom0, this often happens:
>>
>> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
>> done.
>> [   37.513316] OOM killer disabled.
>> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
>> seconds) done.
>> [   37.514837] suspending xenstore...
>> [   37.515142] xen:grant_table: Grant tables using version 1 layout
>> [18446744002.593711] OOM killer enabled.
>> [18446744002.593726] Restarting tasks ... done.
>> [18446744002.604527] Setting capacity to 6291456
> 
> Tonight, I've been through 29 bisect steps to figure out a bit more. A
> make defconfig with enabling Xen PV for domU reproduces the problem
> already, so a complete cycle with compiling and testing had only to take
> about 7 minutes.
> 
> So, it appears that this 18 gazillion seconds of uptime is a thing that
> started happening earlier than the TCP situation already. All of the
> test scenarios resulted in these huge uptime numbers in dmesg. Not all
> of them result in TCP connections hanging.
> 
>> As a side effect, all open TCP connections stall, because the timestamp
>> counters of packets sent to the outside world are affected:
>>
>> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png
> 
> This is happening since:
> 
> commit 9a568de4818dea9a05af141046bd3e589245ab83
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue May 16 14:00:14 2017 -0700
> 
>     tcp: switch TCP TS option (RFC 7323) to 1ms clock
> 
> [...]
> 
>> [...]
>>
>> 3. Since this is related to time and clocks, the last thing today we
>> tried was, instead of using default settings, put "clocksource=tsc
>> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
>> domU linux kernel line. What we observed after doing this, is that the
>> failure happens less often, but still happens. Everything else applies.
> 
> Actually, it seems that the important thing is that uptime of the dom0s
> is not very close to each other. After rebooting all four back without
> tsc options, and then a few hours later rebooting one of them again, I
> could easily reproduce again when live migrating to the later rebooted
> server.
> 
>> Additional question:
>>
>> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
>> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
>> domUs? All our hardware has 'TscInvariant = true'.
>>
>> Related: https://news.ycombinator.com/item?id=13813079
> 
> This is still interesting.
> 
> ---- >8 ----
> 
> Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
> gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
> so next task will be to find out where that started.

And that's...

commit f94c8d116997597fc00f0812b0ab9256e7b0c58f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Mar 1 15:53:38 2017 +0100

    sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface

a.k.a. v4.11-rc2~30^2

Before this commit, time listed in dmesg seems to follow uptime of the
domU, and after it, time in dmesg seems to jump around up and down when
live migrating to different dom0s, with the occasional/frequent jump to
a number above 18000000000 which then also shows the TCP timestamp
breakage since 9a568de4.

So, next question is... what now? Any ideas appreciated.

Can anyone else reproduce this? I have super-common HP DL360 hardware
and mostly default settings, so it shouldn't be that hard.

Should I mail some other mailinglist with a question? Which one? Does
any of you Xen developers have more experience with time keeping code?

Regards,
Hans

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2018-12-27 21:12   ` Hans van Kranenburg
@ 2018-12-28 10:15     ` Juergen Gross
  2018-12-28 14:41       ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Juergen Gross @ 2018-12-28 10:15 UTC (permalink / raw)
  To: Hans van Kranenburg, Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

On 27/12/2018 22:12, Hans van Kranenburg wrote:
> So,
> 
> On 12/24/18 1:32 AM, Hans van Kranenburg wrote:
>>
>> On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
>>>
>>> We've been tracking down a live migration bug during the last three days
>>> here at work, and here's what we found so far.
>>>
>>> 1. Xen version and dom0 linux kernel version don't matter.
>>> 2. DomU kernel is >= Linux 4.13.
>>>
>>> When using live migrate to another dom0, this often happens:
>>>
>>> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
>>> done.
>>> [   37.513316] OOM killer disabled.
>>> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
>>> seconds) done.
>>> [   37.514837] suspending xenstore...
>>> [   37.515142] xen:grant_table: Grant tables using version 1 layout
>>> [18446744002.593711] OOM killer enabled.
>>> [18446744002.593726] Restarting tasks ... done.
>>> [18446744002.604527] Setting capacity to 6291456
>>
>> Tonight, I've been through 29 bisect steps to figure out a bit more. A
>> make defconfig with enabling Xen PV for domU reproduces the problem
>> already, so a complete cycle with compiling and testing had only to take
>> about 7 minutes.
>>
>> So, it appears that this 18 gazillion seconds of uptime is a thing that
>> started happening earlier than the TCP situation already. All of the
>> test scenarios resulted in these huge uptime numbers in dmesg. Not all
>> of them result in TCP connections hanging.
>>
>>> As a side effect, all open TCP connections stall, because the timestamp
>>> counters of packets sent to the outside world are affected:
>>>
>>> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png
>>
>> This is happening since:
>>
>> commit 9a568de4818dea9a05af141046bd3e589245ab83
>> Author: Eric Dumazet <edumazet@google.com>
>> Date:   Tue May 16 14:00:14 2017 -0700
>>
>>     tcp: switch TCP TS option (RFC 7323) to 1ms clock
>>
>> [...]
>>
>>> [...]
>>>
>>> 3. Since this is related to time and clocks, the last thing today we
>>> tried was, instead of using default settings, put "clocksource=tsc
>>> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
>>> domU linux kernel line. What we observed after doing this, is that the
>>> failure happens less often, but still happens. Everything else applies.
>>
>> Actually, it seems that the important thing is that uptime of the dom0s
>> is not very close to each other. After rebooting all four back without
>> tsc options, and then a few hours later rebooting one of them again, I
>> could easily reproduce again when live migrating to the later rebooted
>> server.
>>
>>> Additional question:
>>>
>>> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
>>> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
>>> domUs? All our hardware has 'TscInvariant = true'.
>>>
>>> Related: https://news.ycombinator.com/item?id=13813079
>>
>> This is still interesting.
>>
>> ---- >8 ----
>>
>> Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
>> gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
>> so next task will be to find out where that started.
> 
> And that's...
> 
> commit f94c8d116997597fc00f0812b0ab9256e7b0c58f
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Wed Mar 1 15:53:38 2017 +0100
> 
>     sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
> 
> a.k.a. v4.11-rc2~30^2
> 
> Before this commit, time listed in dmesg seems to follow uptime of the
> domU, and after it, time in dmesg seems to jump around up and down when
> live migrating to different dom0s, with the occasional/frequent jump to
> a number above 18000000000 which then also shows the TCP timestamp
> breakage since 9a568de4.
> 
> So, next question is... what now? Any ideas appreciated.
> 
> Can anyone else reproduce this? I have super-common HP DL360 hardware
> and mostly default settings, so it shouldn't be that hard.
> 
> Should I mail some other mailinglist with a question? Which one? Does
> any of you Xen developers have more experience with time keeping code?

My gut feeling tells me that above patch was neglecting Xen by setting
a non-native TSC clock too often to "stable" (the "only call
clear_sched_clock_stable() when we mark TSC unstable when we use
native_sched_clock()" part of the commit message).

I can have a more thorough look after Jan. 7th.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2018-12-28 10:15     ` Juergen Gross
@ 2018-12-28 14:41       ` Hans van Kranenburg
  2019-01-07 12:04         ` Juergen Gross
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-12-28 14:41 UTC (permalink / raw)
  To: Juergen Gross, Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

On 12/28/18 11:15 AM, Juergen Gross wrote:
> On 27/12/2018 22:12, Hans van Kranenburg wrote:
>> So,
>>
>> On 12/24/18 1:32 AM, Hans van Kranenburg wrote:
>>>
>>> On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
>>>>
>>>> We've been tracking down a live migration bug during the last three days
>>>> here at work, and here's what we found so far.
>>>>
>>>> 1. Xen version and dom0 linux kernel version don't matter.
>>>> 2. DomU kernel is >= Linux 4.13.
>>>>
>>>> When using live migrate to another dom0, this often happens:
>>>>
>>>> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
>>>> done.
>>>> [   37.513316] OOM killer disabled.
>>>> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
>>>> seconds) done.
>>>> [   37.514837] suspending xenstore...
>>>> [   37.515142] xen:grant_table: Grant tables using version 1 layout
>>>> [18446744002.593711] OOM killer enabled.
>>>> [18446744002.593726] Restarting tasks ... done.
>>>> [18446744002.604527] Setting capacity to 6291456
>>>
>>> Tonight, I've been through 29 bisect steps to figure out a bit more. A
>>> make defconfig with enabling Xen PV for domU reproduces the problem
>>> already, so a complete cycle with compiling and testing had only to take
>>> about 7 minutes.
>>>
>>> So, it appears that this 18 gazillion seconds of uptime is a thing that
>>> started happening earlier than the TCP situation already. All of the
>>> test scenarios resulted in these huge uptime numbers in dmesg. Not all
>>> of them result in TCP connections hanging.
>>>
>>>> As a side effect, all open TCP connections stall, because the timestamp
>>>> counters of packets sent to the outside world are affected:
>>>>
>>>> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png
>>>
>>> This is happening since:
>>>
>>> commit 9a568de4818dea9a05af141046bd3e589245ab83
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date:   Tue May 16 14:00:14 2017 -0700
>>>
>>>     tcp: switch TCP TS option (RFC 7323) to 1ms clock
>>>
>>> [...]
>>>
>>>> [...]
>>>>
>>>> 3. Since this is related to time and clocks, the last thing today we
>>>> tried was, instead of using default settings, put "clocksource=tsc
>>>> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
>>>> domU linux kernel line. What we observed after doing this, is that the
>>>> failure happens less often, but still happens. Everything else applies.
>>>
>>> Actually, it seems that the important thing is that uptime of the dom0s
>>> is not very close to each other. After rebooting all four back without
>>> tsc options, and then a few hours later rebooting one of them again, I
>>> could easily reproduce again when live migrating to the later rebooted
>>> server.
>>>
>>>> Additional question:
>>>>
>>>> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
>>>> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
>>>> domUs? All our hardware has 'TscInvariant = true'.
>>>>
>>>> Related: https://news.ycombinator.com/item?id=13813079
>>>
>>> This is still interesting.
>>>
>>> ---- >8 ----
>>>
>>> Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
>>> gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
>>> so next task will be to find out where that started.
>>
>> And that's...
>>
>> commit f94c8d116997597fc00f0812b0ab9256e7b0c58f
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Wed Mar 1 15:53:38 2017 +0100
>>
>>     sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
>>
>> a.k.a. v4.11-rc2~30^2
>>
>> Before this commit, time listed in dmesg seems to follow uptime of the
>> domU, and after it, time in dmesg seems to jump around up and down when
>> live migrating to different dom0s, with the occasional/frequent jump to
>> a number above 18000000000 which then also shows the TCP timestamp
>> breakage since 9a568de4.
>>
>> So, next question is... what now? Any ideas appreciated.
>>
>> Can anyone else reproduce this? I have super-common HP DL360 hardware
>> and mostly default settings, so it shouldn't be that hard.
>>
>> Should I mail some other mailinglist with a question? Which one? Does
>> any of you Xen developers have more experience with time keeping code?
> 
> My gut feeling tells me that above patch was neglecting Xen by setting
> a non-native TSC clock too often to "stable" (the "only call
> clear_sched_clock_stable() when we mark TSC unstable when we use
> native_sched_clock()" part of the commit message).
> 
> I can have a more thorough look after Jan. 7th.

Thanks in advance!

Some additional info:

I've just left a domU running after the initial live migrate:

[  171.727462] Freezing user space processes ... (elapsed 0.002 seconds)
done.
[  171.729825] OOM killer disabled.
[  171.729832] Freezing remaining freezable tasks ... (elapsed 0.001
seconds) done.
[  171.731439] suspending xenstore...
[  171.731672] xen:grant_table: Grant tables using version 1 layout
[18446742891.874140] OOM killer enabled.
[18446742891.874152] Restarting tasks ... done.
[18446742891.914103] Setting capacity to 6291456
[18446742934.549790] 14:13:50 up 3 min, 2 users, load average: 0.07,
0.02, 0.00
[18446742935.561404] 14:13:51 up 3 min, 2 users, load average: 0.07,
0.02, 0.00
[18446742936.572761] 14:13:52 up 3 min, 2 users, load average: 0.06,
0.02, 0.00
[18446742937.583537] 14:13:53 up 3 min, 2 users, load average: 0.06,
0.02, 0.00

I'm simply doing this:
while true; do echo $(uptime) > /dev/kmsg; sleep 10; done

Now, after a while, this happens:

[18446744050.202985] 14:32:26 up 22 min, 2 users, load average: 0.00,
0.00, 0.00
[18446744060.214576] 14:32:36 up 22 min, 2 users, load average: 0.00,
0.00, 0.00
[18446744070.225909] 14:32:46 up 22 min, 2 users, load average: 0.00,
0.00, 0.00
[    6.527718] 14:32:56 up 22 min, 2 users, load average: 0.00, 0.00, 0.00
[   16.539315] 14:33:06 up 22 min, 2 users, load average: 0.00, 0.00, 0.00
[   26.550511] 14:33:16 up 23 min, 2 users, load average: 0.00, 0.00, 0.00

The 23 minutes difference is exactly the difference in uptime between
the two dom0s involved for live migration:

source dom0: up 4 days, 19:23
destination dom0: up 4 days, 19:00

So that explains the 18446742891.874140 number, which just corresponds
to something near to 'minus 23 minutes'.

Happy holidays,

Hans

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2018-12-28 14:41       ` Hans van Kranenburg
@ 2019-01-07 12:04         ` Juergen Gross
  2019-01-07 12:56           ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Juergen Gross @ 2019-01-07 12:04 UTC (permalink / raw)
  To: Hans van Kranenburg, Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

On 28/12/2018 15:41, Hans van Kranenburg wrote:
> On 12/28/18 11:15 AM, Juergen Gross wrote:
>> On 27/12/2018 22:12, Hans van Kranenburg wrote:
>>> So,
>>>
>>> On 12/24/18 1:32 AM, Hans van Kranenburg wrote:
>>>>
>>>> On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
>>>>>
>>>>> We've been tracking down a live migration bug during the last three days
>>>>> here at work, and here's what we found so far.
>>>>>
>>>>> 1. Xen version and dom0 linux kernel version don't matter.
>>>>> 2. DomU kernel is >= Linux 4.13.
>>>>>
>>>>> When using live migrate to another dom0, this often happens:
>>>>>
>>>>> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
>>>>> done.
>>>>> [   37.513316] OOM killer disabled.
>>>>> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
>>>>> seconds) done.
>>>>> [   37.514837] suspending xenstore...
>>>>> [   37.515142] xen:grant_table: Grant tables using version 1 layout
>>>>> [18446744002.593711] OOM killer enabled.
>>>>> [18446744002.593726] Restarting tasks ... done.
>>>>> [18446744002.604527] Setting capacity to 6291456
>>>>
>>>> Tonight, I've been through 29 bisect steps to figure out a bit more. A
>>>> make defconfig with enabling Xen PV for domU reproduces the problem
>>>> already, so a complete cycle with compiling and testing had only to take
>>>> about 7 minutes.
>>>>
>>>> So, it appears that this 18 gazillion seconds of uptime is a thing that
>>>> started happening earlier than the TCP situation already. All of the
>>>> test scenarios resulted in these huge uptime numbers in dmesg. Not all
>>>> of them result in TCP connections hanging.
>>>>
>>>>> As a side effect, all open TCP connections stall, because the timestamp
>>>>> counters of packets sent to the outside world are affected:
>>>>>
>>>>> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png
>>>>
>>>> This is happening since:
>>>>
>>>> commit 9a568de4818dea9a05af141046bd3e589245ab83
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date:   Tue May 16 14:00:14 2017 -0700
>>>>
>>>>     tcp: switch TCP TS option (RFC 7323) to 1ms clock
>>>>
>>>> [...]
>>>>
>>>>> [...]
>>>>>
>>>>> 3. Since this is related to time and clocks, the last thing today we
>>>>> tried was, instead of using default settings, put "clocksource=tsc
>>>>> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
>>>>> domU linux kernel line. What we observed after doing this, is that the
>>>>> failure happens less often, but still happens. Everything else applies.
>>>>
>>>> Actually, it seems that the important thing is that uptime of the dom0s
>>>> is not very close to each other. After rebooting all four back without
>>>> tsc options, and then a few hours later rebooting one of them again, I
>>>> could easily reproduce again when live migrating to the later rebooted
>>>> server.
>>>>
>>>>> Additional question:
>>>>>
>>>>> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
>>>>> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
>>>>> domUs? All our hardware has 'TscInvariant = true'.
>>>>>
>>>>> Related: https://news.ycombinator.com/item?id=13813079
>>>>
>>>> This is still interesting.
>>>>
>>>> ---- >8 ----
>>>>
>>>> Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
>>>> gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
>>>> so next task will be to find out where that started.
>>>
>>> And that's...
>>>
>>> commit f94c8d116997597fc00f0812b0ab9256e7b0c58f
>>> Author: Peter Zijlstra <peterz@infradead.org>
>>> Date:   Wed Mar 1 15:53:38 2017 +0100
>>>
>>>     sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
>>>
>>> a.k.a. v4.11-rc2~30^2
>>>
>>> Before this commit, time listed in dmesg seems to follow uptime of the
>>> domU, and after it, time in dmesg seems to jump around up and down when
>>> live migrating to different dom0s, with the occasional/frequent jump to
>>> a number above 18000000000 which then also shows the TCP timestamp
>>> breakage since 9a568de4.
>>>
>>> So, next question is... what now? Any ideas appreciated.
>>>
>>> Can anyone else reproduce this? I have super-common HP DL360 hardware
>>> and mostly default settings, so it shouldn't be that hard.
>>>
>>> Should I mail some other mailinglist with a question? Which one? Does
>>> any of you Xen developers have more experience with time keeping code?
>>
>> My gut feeling tells me that above patch was neglecting Xen by setting
>> a non-native TSC clock too often to "stable" (the "only call
>> clear_sched_clock_stable() when we mark TSC unstable when we use
>> native_sched_clock()" part of the commit message).
>>
>> I can have a more thorough look after Jan. 7th.
> 
> Thanks in advance!
> 
> Some additional info:
> 
> I've just left a domU running after the initial live migrate:
> 
> [  171.727462] Freezing user space processes ... (elapsed 0.002 seconds)
> done.
> [  171.729825] OOM killer disabled.
> [  171.729832] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [  171.731439] suspending xenstore...
> [  171.731672] xen:grant_table: Grant tables using version 1 layout
> [18446742891.874140] OOM killer enabled.
> [18446742891.874152] Restarting tasks ... done.
> [18446742891.914103] Setting capacity to 6291456
> [18446742934.549790] 14:13:50 up 3 min, 2 users, load average: 0.07,
> 0.02, 0.00
> [18446742935.561404] 14:13:51 up 3 min, 2 users, load average: 0.07,
> 0.02, 0.00
> [18446742936.572761] 14:13:52 up 3 min, 2 users, load average: 0.06,
> 0.02, 0.00
> [18446742937.583537] 14:13:53 up 3 min, 2 users, load average: 0.06,
> 0.02, 0.00
> 
> I'm simply doing this:
> while true; do echo $(uptime) > /dev/kmsg; sleep 10; done
> 
> Now, after a while, this happens:
> 
> [18446744050.202985] 14:32:26 up 22 min, 2 users, load average: 0.00,
> 0.00, 0.00
> [18446744060.214576] 14:32:36 up 22 min, 2 users, load average: 0.00,
> 0.00, 0.00
> [18446744070.225909] 14:32:46 up 22 min, 2 users, load average: 0.00,
> 0.00, 0.00
> [    6.527718] 14:32:56 up 22 min, 2 users, load average: 0.00, 0.00, 0.00
> [   16.539315] 14:33:06 up 22 min, 2 users, load average: 0.00, 0.00, 0.00
> [   26.550511] 14:33:16 up 23 min, 2 users, load average: 0.00, 0.00, 0.00
> 
> The 23 minutes difference is exactly the difference in uptime between
> the two dom0s involved for live migration:
> 
> source dom0: up 4 days, 19:23
> destination dom0: up 4 days, 19:00
> 
> So that explains the 18446742891.874140 number, which just corresponds
> to something near to 'minus 23 minutes'.

I have a local reproducer for the issue now: instead of using live
migration I'm just doing a "xl save" after the guest running for some
minutes. The I reboot the host and do a "xl restore" as soon as
possible.

Another note: HVM domains (and probably PVH, too) show the huge time
info ("[18446742937.583537] ..."), while PV domains seem to show just
a small jump backwards in time:

[  224.719316] Freezing user space processes ... (elapsed 0.001 seconds)
done.
[  224.720443] OOM killer disabled.
[  224.720448] Freezing remaining freezable tasks ... (elapsed 0.001
seconds) done.
[  224.721678] PM: freeze of devices complete after 0.107 msecs
[  224.721687] suspending xenstore...
[  224.721726] PM: late freeze of devices complete after 0.037 msecs
[  224.736062] PM: noirq freeze of devices complete after 14.325 msecs
[  224.736155] xen:grant_table: Grant tables using version 1 layout
[    4.404026] Suspended for 187.219 seconds

I'm now looking for a way to repair the issue.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2019-01-07 12:04         ` Juergen Gross
@ 2019-01-07 12:56           ` Hans van Kranenburg
  2019-01-07 13:05             ` Juergen Gross
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2019-01-07 12:56 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: Igor Yurchenko

On 1/7/19 1:04 PM, Juergen Gross wrote:
> On 28/12/2018 15:41, Hans van Kranenburg wrote:
>> On 12/28/18 11:15 AM, Juergen Gross wrote:
>>
>> [...]
>> So that explains the 18446742891.874140 number, which just corresponds
>> to something near to 'minus 23 minutes'.
> 
> I have a local reproducer for the issue now: instead of using live
> migration I'm just doing a "xl save" after the guest running for some
> minutes. The I reboot the host and do a "xl restore" as soon as
> possible.
> 
> Another note: HVM domains (and probably PVH, too) show the huge time
> info ("[18446742937.583537] ..."), while PV domains seem to show just
> a small jump backwards in time:
> 
> [  224.719316] Freezing user space processes ... (elapsed 0.001 seconds)
> done.
> [  224.720443] OOM killer disabled.
> [  224.720448] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [  224.721678] PM: freeze of devices complete after 0.107 msecs
> [  224.721687] suspending xenstore...
> [  224.721726] PM: late freeze of devices complete after 0.037 msecs
> [  224.736062] PM: noirq freeze of devices complete after 14.325 msecs
> [  224.736155] xen:grant_table: Grant tables using version 1 layout
> [    4.404026] Suspended for 187.219 seconds

And if you cause a time difference that lets it go down below zero?

I can just as easily reproduce with PV, and don't see much difference in
behavior with PVH. Actually, all the bisect steps to find it were done
using PV.

I haven't tried HVM, since I'm not using that at all.

Hans
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls.
  2019-01-07 12:56           ` Hans van Kranenburg
@ 2019-01-07 13:05             ` Juergen Gross
  0 siblings, 0 replies; 8+ messages in thread
From: Juergen Gross @ 2019-01-07 13:05 UTC (permalink / raw)
  To: Hans van Kranenburg, xen-devel; +Cc: Igor Yurchenko

On 07/01/2019 13:56, Hans van Kranenburg wrote:
> On 1/7/19 1:04 PM, Juergen Gross wrote:
>> On 28/12/2018 15:41, Hans van Kranenburg wrote:
>>> On 12/28/18 11:15 AM, Juergen Gross wrote:
>>>
>>> [...]
>>> So that explains the 18446742891.874140 number, which just corresponds
>>> to something near to 'minus 23 minutes'.
>>
>> I have a local reproducer for the issue now: instead of using live
>> migration I'm just doing a "xl save" after the guest running for some
>> minutes. The I reboot the host and do a "xl restore" as soon as
>> possible.
>>
>> Another note: HVM domains (and probably PVH, too) show the huge time
>> info ("[18446742937.583537] ..."), while PV domains seem to show just
>> a small jump backwards in time:
>>
>> [  224.719316] Freezing user space processes ... (elapsed 0.001 seconds)
>> done.
>> [  224.720443] OOM killer disabled.
>> [  224.720448] Freezing remaining freezable tasks ... (elapsed 0.001
>> seconds) done.
>> [  224.721678] PM: freeze of devices complete after 0.107 msecs
>> [  224.721687] suspending xenstore...
>> [  224.721726] PM: late freeze of devices complete after 0.037 msecs
>> [  224.736062] PM: noirq freeze of devices complete after 14.325 msecs
>> [  224.736155] xen:grant_table: Grant tables using version 1 layout
>> [    4.404026] Suspended for 187.219 seconds
> 
> And if you cause a time difference that lets it go down below zero?

Oh yes, of course it will show the same symptoms. Silly me!


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-01-07 13:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-21 17:54 Live migrate with Linux >= 4.13 domU causes kernel time jumps and TCP connection stalls Hans van Kranenburg
2018-12-24  0:32 ` Hans van Kranenburg
2018-12-27 21:12   ` Hans van Kranenburg
2018-12-28 10:15     ` Juergen Gross
2018-12-28 14:41       ` Hans van Kranenburg
2019-01-07 12:04         ` Juergen Gross
2019-01-07 12:56           ` Hans van Kranenburg
2019-01-07 13:05             ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.