All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: 4.10.1 Xen crash and reboot
@ 2019-01-07 11:12 Patrick Beckmann
  2019-02-11 10:41 ` Patrick Beckmann
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Beckmann @ 2019-01-07 11:12 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

Hi,

I just joined this list and am referring to
  https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00938.html

We have experienced several crashes of a recent Debian 9 Dom0 on new
hardware with Xen version "4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10".
After reporting it within Debian bug #912975
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912975) it was
pointed out, that this would be the same error, and I was asked to join
discussion here.

Unfortunately we are unable to reliably reproduce the behaviour.
Currently we are running the Dom0 with Xen version
"4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9" to test the assumption, that a
bug was introduced between these versions. We have not yet tried setting
pcid=0. Please let me know, if you think, that this would be a more
reasonable test for now.

Best Regards,
Patrick Beckmann

[-- Attachment #2: crash1.txt --]
[-- Type: text/plain, Size: 7312 bytes --]

[SOL Session operational.  Use ~? for help]
[   99.992731] xen-blkback: backend/vbd/19/51712: prepare for reconnect
[  101.634684] xen-blkback: backend/vbd/20/51712: prepare for reconnect
[  103.653671] xen-blkback: backend/vbd/19/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
[  103.827314] vif vif-19-0 vif19.0: Guest Rx ready
[  103.827427] IPv6: ADDRCONF(NETDEV_CHANGE): vif19.0: link becomes ready
[  103.827534] br02: port 15(vif19.0) entered blocking state
[  103.827541] br02: port 15(vif19.0) entered forwarding state
[  104.476998] xen-blkback: backend/vbd/20/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
[  104.660889] vif vif-20-0 vif20.0: Guest Rx ready
[  104.661018] IPv6: ADDRCONF(NETDEV_CHANGE): vif20.0: link becomes ready
[  104.661168] br026: port 2(vif20.0) entered blocking state
[  104.661184] br026: port 2(vif20.0) entered forwarding state
(XEN) d8 L1TF-vulnerable L1e 0000000001a23320 - Shadowing
(XEN) d8 L1TF-vulnerable L1e 0000000001a23320 - Shadowing
(XEN) d8 L1TF-vulnerable L1e 0000000001a23320 - Shadowing
(XEN) d11 L1TF-vulnerable L1e 00000000020c3320 - Shadowing
(XEN) d13 L1TF-vulnerable L1e 0000000001a3b320 - Shadowing
(XEN) d15 L1TF-vulnerable L1e 0000000001a23320 - Shadowing

Debian GNU/Linux 9 caribou hvc0

caribou login: 
Debian GNU/Linux 9 caribou hvc0

caribou login: [ 4676.600094] br02: port 14(vif17.0) entered disabled state
[ 4676.744064] br02: port 14(vif17.0) entered disabled state
[ 4676.745573] device vif17.0 left promiscuous mode
[ 4676.745618] br02: port 14(vif17.0) entered disabled state
[ 4683.146619] br02: port 14(vif21.0) entered blocking state
[ 4683.146678] br02: port 14(vif21.0) entered disabled state
[ 4683.146921] device vif21.0 entered promiscuous mode
[ 4683.153997] IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready
[ 4683.639331] xen-blkback: backend/vbd/21/51712: using 1 queues, protocol 1 (x86_64-abi) 
[ 4684.544484] xen-blkback: backend/vbd/21/51712: prepare for reconnect
[ 4684.938636] xen-blkback: backend/vbd/21/51712: using 1 queues, protocol 1 (x86_64-abi) 
[ 4692.235692] xen-blkback: backend/vbd/21/51712: prepare for reconnect
[ 4694.917436] vif vif-21-0 vif21.0: Guest Rx ready
[ 4694.917800] IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready
[ 4694.917918] br02: port 14(vif21.0) entered blocking state
[ 4694.917926] br02: port 14(vif21.0) entered forwarding state
[ 4694.921344] xen-blkback: backend/vbd/21/51712: using 2 queues, protocol 1 (x86_64-abi) persistent grants

Debian GNU/Linux 9 caribou hvc0

caribou login: (XEN) ----[ Xen-4.8.5-pre  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    32
(XEN) RIP:    e008:[<ffff82d08023116d>] guest_4.o#sh_page_fault__guest_4+0x75d/0x1e30
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d8v0)
(XEN) rax: 00007fb5797e6580   rbx: ffff8310f4372000   rcx: ffff81c0e0600000
(XEN) rdx: 0000000000000000   rsi: ffff8310f4372000   rdi: 000000000001fed5
(XEN) rbp: ffff8310f4372000   rsp: ffff8340250e7c78   r8:  000000000001fed5
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff81c0e06ff6a8   r13: 000000000407fad6   r14: ffff830078da7000
(XEN) r15: ffff8340250e7ef8   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 000000407ec02001   cr2: ffff81c0e06ff6a8
(XEN) fsb: 00007fb58fc26700   gsb: 0000000000000000   gss: ffff8801fea00000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08023116d> (guest_4.o#sh_page_fault__guest_4+0x75d/0x1e30):
(XEN)  ff ff 03 00 4e 8d 24 c1 <49> 8b 0c 24 f6 c1 01 0f 84 b6 06 00 00 48 c1 e1
(XEN) Xen stack trace from rsp=ffff8340250e7c78:
(XEN)    00007fb5797e6580 00000000027372df ffff82d080323600 ffff8310f4372648
(XEN)    ffff8310f43726a8 00000000027372df ffff8340250e7d50 ffff8340250e7d98
(XEN)    00000007fb5797e6 0000000000000090 ffff82d080323618 00000000000007f8
(XEN)    00000000000006a8 0000000000000e58 0000000000000f30 ffff82d000000000
(XEN)    000000000000000d 0000005100000002 00000000000001e6 ffff8340250e7d20
(XEN)    00000000000000e0 0000000000000000 000000000277f512 ffff830078da7000
(XEN)    0000000000000001 ffff830078da7bc0 00000000020dd93d 00007fb5797e6580
(XEN)    0000002700075067 000000280ae61067 000000280ca6f067 00000027372df967
(XEN)    000000000267c9a0 0000000002700075 000000000280ae61 000000000280ca6f
(XEN)    000000407faf7067 ffff830078da7000 ffff8310f4372000 ffff8340250e7ef8
(XEN)    ffff82d08023a910 0000000000000000 000000005c2d2f4a ffff82d08023a780
(XEN)    ffff8310f4372000 ffff8340250e7fff ffff830078da7000 ffff82d08023aa0f
(XEN)    ffff82d08023f913 ffff82d08023f907 ffff82d08023f913 ffff82d08023f907
(XEN)    ffff82d08023f913 ffff82d08023f907 ffff82d08023f913 ffff82d08023f907
(XEN)    ffff82d08023f913 ffff82d08023f907 ffff82d08023f913 ffff82d08023f907
(XEN)    ffff82d08023f913 ffff82d08023f907 ffff82d08023f913 ffff8340250e7ef8
(XEN)    00007fb5797e6580 ffff830078da7000 0000000000000014 ffff8310f4372000
(XEN)    0000000000000000 ffff82d08019f5a2 ffff82d08023f913 ffff82d08023f907
(XEN)    ffff82d08023f913 ffff830078da7000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffff8340250e7fff 0000000000000000 ffff82d08023f9d9
(XEN) Xen call trace:
(XEN)    [<ffff82d08023116d>] guest_4.o#sh_page_fault__guest_4+0x75d/0x1e30
(XEN)    [<ffff82d08023a910>] do_iret+0/0x1c0
(XEN)    [<ffff82d08023a780>] toggle_guest_pt+0x30/0x160
(XEN)    [<ffff82d08023aa0f>] do_iret+0xff/0x1c0
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08019f5a2>] do_page_fault+0x1f2/0x4c0
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d08023f913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d08023f9d9>] entry.o#handle_exception_saved+0x68/0x94
(XEN) 
(XEN) Pagetable walk from ffff81c0e06ff6a8:
(XEN)  L4[0x103] = 000000407ec02063 ffffffffffffffff
(XEN)  L3[0x103] = 000000407ec02063 ffffffffffffffff
(XEN)  L2[0x103] = 000000407ec02063 ffffffffffffffff 
(XEN)  L1[0x0ff] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 32:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81c0e06ff6a8
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

[-- Attachment #3: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2019-01-07 11:12 4.10.1 Xen crash and reboot Patrick Beckmann
@ 2019-02-11 10:41 ` Patrick Beckmann
  0 siblings, 0 replies; 11+ messages in thread
From: Patrick Beckmann @ 2019-02-11 10:41 UTC (permalink / raw)
  To: xen-devel

Hi,

Am 07.01.2019 um 12:12 schrieb Patrick Beckmann:
> I just joined this list and am referring to
>   https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00938.html
> 
> We have experienced several crashes of a recent Debian 9 Dom0 on new
> hardware with Xen version "4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10".
> After reporting it within Debian bug #912975
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912975) it was
> pointed out, that this would be the same error, and I was asked to join
> discussion here.
> 
> Unfortunately we are unable to reliably reproduce the behaviour.
> Currently we are running the Dom0 with Xen version
> "4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9" to test the assumption, that a
> bug was introduced between these versions.
Since more than five weeks the Dom0 is running stable with the Debian 9
distributed Xen version
  4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
while with version
  4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
it was crashing between once every two weeks and twice a day.

We have several other servers with similar hardware and kept on the
4.8.3 Xen version, which never crashed so far. Another server, that was
crashing with Xen 4.8.4, is running it with "pcid=0" since 31 days.
However we migrated several machines away from that one, so that it has
a very low load/diversity within the set of guest, which might falsify
results.

I hope this helps a bit to find the cause of the crashes. If you need
any further information, please let me know!

Thanks and Best Regards,
Patrick Beckmann

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2019-01-01 19:46   ` Andy Smith
  2019-01-04 10:16     ` Jan Beulich
@ 2019-01-30 18:53     ` Andy Smith
  1 sibling, 0 replies; 11+ messages in thread
From: Andy Smith @ 2019-01-30 18:53 UTC (permalink / raw)
  To: xen-devel

Hi,

On Tue, Jan 01, 2019 at 07:46:57PM +0000, Andy Smith wrote:
> The test host is slightly different hardware to the others: Xeon
> E5-1680v4 on there as opposed to Xeon D-1540 previously.
> 
> Test host is now running with pcid=0 to see if that helps. The
> longest this guest has been able to run so far without crashing the
> host is 14 days.

Just to note, it has so far been 28 days and this host using pcid=0
on the command line has not crashed again yet.

I still have no way to reproduce the problem on demand but if anyone
wants me me to do any further debugging with pcid=1 I can do that.

Thanks,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2019-01-04 10:16     ` Jan Beulich
@ 2019-01-04 12:28       ` Andy Smith
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Smith @ 2019-01-04 12:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Hello,

On Fri, Jan 04, 2019 at 03:16:32AM -0700, Jan Beulich wrote:
> >>> On 01.01.19 at 20:46, <andy@strugglers.net> wrote:
> > I did move the suspect guest to a test host that does not have
> > pcid=0 and 10 days later it crashed too:
> 
> Thanks for trying this. It is now pretty clear that we need a means
> to repro (and debug).

Also interesting to me is that the guest that seems to trigger this
is Debian stretch 64-bit with an older kernel that does not have
L1TF mitigations. I have asked the owner of it not to upgrade their
kernel yet because I think when the guest kernel behaves correctly
it will stop crashing the host and we'll learn nothing.

Sadly I cannot reproduce this on demand.

Cheers,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2019-01-01 19:46   ` Andy Smith
@ 2019-01-04 10:16     ` Jan Beulich
  2019-01-04 12:28       ` Andy Smith
  2019-01-30 18:53     ` Andy Smith
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2019-01-04 10:16 UTC (permalink / raw)
  To: Andy Smith; +Cc: xen-devel

>>> On 01.01.19 at 20:46, <andy@strugglers.net> wrote:
> On Fri, Dec 21, 2018 at 06:55:38PM +0000, Andy Smith wrote:
>> Is it worth me moving this guest to a test host without pcid=0 to
>> see if it crashes it, meanwhile keeping production hosts with
>> pcid=0? And then putting pcid=0 on the test host to see if it
>> survives longer?
> 
> I did move the suspect guest to a test host that does not have
> pcid=0 and 10 days later it crashed too:

Thanks for trying this. It is now pretty clear that we need a means
to repro (and debug).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2018-12-21 18:55 ` Andy Smith
@ 2019-01-01 19:46   ` Andy Smith
  2019-01-04 10:16     ` Jan Beulich
  2019-01-30 18:53     ` Andy Smith
  0 siblings, 2 replies; 11+ messages in thread
From: Andy Smith @ 2019-01-01 19:46 UTC (permalink / raw)
  To: xen-devel

Hello,

On Fri, Dec 21, 2018 at 06:55:38PM +0000, Andy Smith wrote:
> Is it worth me moving this guest to a test host without pcid=0 to
> see if it crashes it, meanwhile keeping production hosts with
> pcid=0? And then putting pcid=0 on the test host to see if it
> survives longer?

I did move the suspect guest to a test host that does not have
pcid=0 and 10 days later it crashed too:

(XEN) ----[ Xen-4.10.3-pre  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    15
(XEN) RIP:    e008:[<ffff82d08033d5b5>] guest_4.o#shadow_set_l1e+0x75/0x6a0
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d7v0)
(XEN) rax: ffff82e07b2e69c0   rbx: 8000003d9734e027   rcx: 0000000000000000
(XEN) rdx: ffff82e000000000   rsi: ffff81c4003dfa70   rdi: 00000000ffffffff
(XEN) rbp: 0000000003d9734e   rsp: ffff83400e2afbd8   r8:  0000000003d93187
(XEN) r9:  0000000000000000   r10: ffff8300789f2000   r11: 0000000000000000
(XEN) r12: 8000003d9734e027   r13: ffff833f5be74000   r14: 0000000003d9734e
(XEN) r15: ffff81c4003dfa70   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 0000003f56c31000   cr2: ffff81c4003dfa70
(XEN) fsb: 00007f9de67fc700   gsb: ffff88007f200000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08033d5b5> (guest_4.o#shadow_set_l1e+0x75/0x6a0):
(XEN)  0f 20 0f 85 23 01 00 00 <4d> 8b 37 4c 39 f3 0f 84 97 01 00 00 49 89 da 89
(XEN) Xen stack trace from rsp=ffff83400e2afbd8:
(XEN)    0000003d9734e000 0000000003d93187 0000000000000000 ffff833f00000002
(XEN)    ffff8300789f2000 ffff833f5be74000 ffff81c4003dfa70 ffff83400e2afef8
(XEN)    0000000003d93187 0000000003d9734e ffff8300789f2000 ffff82d08033f6f2
(XEN)    ffff833deb418e08 ffff88007bf4e4d8 ffff833f5be74600 0000000003d9734e
(XEN)    0000000003d9734e 0000000003d9734e ffff83400e2afd70 ffff83400e2afd20
(XEN)    000ffff88007bf4e 0000000000000078 ffff82d0805802c0 000000028033c294
(XEN)    0000000000000880 0000000000000008 0000000000000ef8 ffff82d0805802c0
(XEN)    0000000003d93187 ffff88007bf4e4d8 0000000000000a70 000000000000014e
(XEN)    ffff81c0e2001ef8 01ff82d000000000 8000003d9734e027 ffff82d000000000
(XEN)    ffff833f00000001 00000001789f2000 ffff83400e2affff ffff83400e2afd20
(XEN)    000000000000006f ffff88007bf4e4d8 0000003e11814067 0000003e11706067
(XEN)    0000003d9341f067 8010003d9734e067 0000000003e1310f 0000000003e11814
(XEN)    0000000003e11706 0000000003d9341f 0000000000000005 ffff82d0803265b4
(XEN)    ffff82e07b25e140 ffff833f5be74000 ffff82e07ead8620 0000000100000001
(XEN)    0dff834003e1310f 0000000d00000010 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 ffff82d0802845d3
(XEN)    000000000000000d ffff82d08032a359 ffff833f5be74000 000000010000000d
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN) Xen call trace:
(XEN)    [<ffff82d08033d5b5>] guest_4.o#shadow_set_l1e+0x75/0x6a0
(XEN)    [<ffff82d08033f6f2>] guest_4.o#sh_page_fault__guest_4+0x8f2/0x2060
(XEN)    [<ffff82d0803265b4>] shadow_alloc+0x1d4/0x380
(XEN)    [<ffff82d0802845d3>] get_page+0x13/0xe0
(XEN)    [<ffff82d08032a359>] sh_resync_all+0xb9/0x2b0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0802a1842>] do_page_fault+0x1a2/0x4e0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0803549d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
(XEN) 
(XEN) Pagetable walk from ffff81c4003dfa70:
(XEN)  L4[0x103] = 8000003f56c31063 ffffffffffffffff
(XEN)  L3[0x110] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 15:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81c4003dfa70
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

The test host is slightly different hardware to the others: Xeon
E5-1680v4 on there as opposed to Xeon D-1540 previously.

Test host is now running with pcid=0 to see if that helps. The
longest this guest has been able to run so far without crashing the
host is 14 days.

Cheers,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2018-12-10 15:58 Andy Smith
  2018-12-10 16:29 ` Jan Beulich
@ 2018-12-21 18:55 ` Andy Smith
  2019-01-01 19:46   ` Andy Smith
  1 sibling, 1 reply; 11+ messages in thread
From: Andy Smith @ 2018-12-21 18:55 UTC (permalink / raw)
  To: xen-devel

Hello,

And again today:

(XEN) ----[ Xen-4.10.3-pre  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    4
(XEN) RIP:    e008:[<ffff82d08033f50b>] guest_4.o#sh_page_fault__guest_4+0x70b/0x2060
(XEN) RFLAGS: 0000000000010203   CONTEXT: hypervisor (d61v1)
(XEN) rax: 000000c422641dd0   rbx: ffff832005c49000   rcx: ffff81c0e0600000
(XEN) rdx: 0000000000000000   rsi: ffff832005c49000   rdi: 000000c422641dd0
(XEN) rbp: ffff81c0e0601880   rsp: ffff83207e607c38   r8:  0000000000000310
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff83207e607ef8   r13: 0000000000f9cea7   r14: 0000000000000000
(XEN) r15: ffff830079592000   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 0000001ffab1a001   cr2: ffff81c0e0601880
(XEN) fsb: 00007f89c67fc700   gsb: 0000000000000000   gss: ffff88007f300000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08033f50b> (guest_4.o#sh_page_fault__guest_4+0x70b/0x2060):
(XEN)  49 c1 e8 1e 4a 8d 2c c1 <48> 8b 4d 00 f6 c1 01 0f 84 f8 06 00 00 48 c1 e1
(XEN) Xen stack trace from rsp=ffff83207e607c38:
(XEN)    ffff830f68748208 000000c422641dd0 ffff832005c49600 0000000000f9cea7
(XEN)    ffff832005c49660 ffff832005c49000 ffff83207e607d70 ffff83207e607d20
(XEN)    000000000c422641 0000000000000090 ffff82d0805802c0 0000000205c49000
(XEN)    0000000000000008 0000000000000880 0000000000000898 ffff82d0805802c0
(XEN)    0000000001fd58a1 0000000001ffab1a 0000000000000208 0000000000000041
(XEN)    8000000f9cea7825 01ff82d000000000 000000000000000d ffff82d000000000
(XEN)    ffff832005c49000 000000010000000d ffff83207e607fff ffff83207e607d20
(XEN)    00000000000000a1 000000c422641dd0 0000000f86569067 0000000f86544067
(XEN)    0000000f68748067 8000000f9cea7925 0000000000f87171 0000000000f86569
(XEN)    0000000000f86544 0000000000f68748 0000000000000005 ffffffffffffffff
(XEN)    ffff82e03ff56340 ffff832005c49000 0000000500007ff0 0000000000000000
(XEN)    ffff83207e607e18 ffff830079592000 ffff832005c49000 ffff83207e607ef8
(XEN)    000000c422641dd0 ffff82d08034e4b0 0000000000000000 ffff82d080349e20
(XEN)    0000000000000000 ffff83207e607fff ffff830079592000 ffff82d08034e5ae
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff83207e607ef8 ffff830079592000 000000c422641dd0
(XEN)    ffff832005c49000 0000000000000004 0000000000000000 ffff82d0802a1842
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN) Xen call trace:
(XEN)    [<ffff82d08033f50b>] guest_4.o#sh_page_fault__guest_4+0x70b/0x2060
(XEN)    [<ffff82d08034e4b0>] do_iret+0/0x1a0
(XEN)    [<ffff82d080349e20>] toggle_guest_pt+0x30/0x160
(XEN)    [<ffff82d08034e5ae>] do_iret+0xfe/0x1a0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0802a1842>] do_page_fault+0x1a2/0x4e0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0803549d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
(XEN)
(XEN) Pagetable walk from ffff81c0e0601880:
(XEN)  L4[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L3[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L2[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L1[0x001] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81c0e0601880
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

Host has now rebooted into a hypervisor with pcid=0 command line.

I note that:

(XEN) RFLAGS: 0000000000010203   CONTEXT: hypervisor (d61v1)

and the previous incident (below):

(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d31v1)

These are the same guest.

Is it worth me moving this guest to a test host without pcid=0 to
see if it crashes it, meanwhile keeping production hosts with
pcid=0? And then putting pcid=0 on the test host to see if it
survives longer?

This will take quite a long time to gain confidence of, since the
incidents are about 2 weeks apart each time.

Thanks,
Andy

On Mon, Dec 10, 2018 at 03:58:41PM +0000, Andy Smith wrote:
> Hi,
> 
> Up front information:
> 
> Today one of my Xen hosts crashed with this logging on the serial:
> 
> (XEN) ----[ Xen-4.10.1  x86_64  debug=n   Not tainted ]----
> (XEN) CPU:    15
> (XEN) RIP:    e008:[<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
> (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d31v1)
> (XEN) rax: ffff82e01ecfae80   rbx: 0000000f67d74025   rcx: 0000000000000000
> (XEN) rdx: ffff82e000000000   rsi: ffff81bfd79f12d8   rdi: 00000000ffffffff
> (XEN) rbp: 0000000000f67d74   rsp: ffff83202628fbd8   r8:  00000000010175c6
> (XEN) r9:  0000000000000000   r10: ffff830079592000   r11: 0000000000000000
> (XEN) r12: 0000000f67d74025   r13: ffff832020549000   r14: 0000000000f67d74
> (XEN) r15: ffff81bfd79f12d8   cr0: 0000000080050033   cr4: 0000000000372660
> (XEN) cr3: 0000001fd5b8d001   cr2: ffff81bfd79f12d8
> (XEN) fsb: 00007faf3e71f700   gsb: 0000000000000000   gss: ffff88007f300000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d08033db45> (guest_4.o#shadow_set_l1e+0x75/0x6a0):
> (XEN)  0f 20 0f 85 23 01 00 00 <4d> 8b 37 4c 39 f3 0f 84 97 01 00 00 49 89 da 89
> (XEN) Xen stack trace from rsp=ffff83202628fbd8:
> (XEN)    0000000f67d74000 00000000010175c6 0000000000000000 ffff832000000002
> (XEN)    ffff830079592000 ffff832020549000 ffff81bfd79f12d8 ffff83202628fef8
> (XEN)    00000000010175c6 0000000000f67d74 ffff830079592000 ffff82d08033fc82
> (XEN)    8000000fad0dc125 00007faf3e25bba0 ffff832020549600 0000000000f67d74
> (XEN)    0000000000f67d74 0000000000f67d74 ffff83202628fd70 ffff83202628fd20
> (XEN)    00000007faf3e25b 00000000000000c0 ffff82d0805802c0 0000000220549000
> (XEN)    00000000000007f8 00000000000005e0 0000000000000f88 ffff82d0805802c0
> (XEN)    00000000010175c6 00007faf3e25bba0 00000000000002d8 000000000000005b
> (XEN)    ffff81c0dfebcf88 01ff82d000000000 0000000f67d74025 ffff82d000000000
> (XEN)    ffff832020549000 000000010000000d ffff83202628ffff ffff83202628fd20
> (XEN)    00000000000000e9 00007faf3e25bba0 0000000f472df067 0000000f49296067
> (XEN)    0000000f499f1067 0000000f67d74125 0000000000f498cf 0000000000f472df
> (XEN)    0000000000f49296 0000000000f499f1 0000000000000015 ffffffffffffffff
> (XEN)    ffff82e03fab71a0 ffff830079593000 ffff82d0803557eb ffff82d08020bf4a
> (XEN)    0000000000000000 ffff830079592000 ffff832020549000 ffff83202628fef8
> (XEN)    0000000000000002 ffff82d08034e9b0 0000000000633400 ffff82d08034a330
> (XEN)    ffff830079592000 ffff83202628ffff ffff830079592000 ffff82d08034eaae
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
> (XEN)    [<ffff82d08033fc82>] guest_4.o#sh_page_fault__guest_4+0x8f2/0x2060
> (XEN)    [<ffff82d0803557eb>] common_interrupt+0x9b/0x120
> (XEN)    [<ffff82d08020bf4a>] evtchn_check_pollers+0x1a/0xb0
> (XEN)    [<ffff82d08034e9b0>] do_iret+0/0x1a0
> (XEN)    [<ffff82d08034a330>] toggle_guest_pt+0x30/0x160
> (XEN)    [<ffff82d08034eaae>] do_iret+0xfe/0x1a0
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d0802a16b2>] do_page_fault+0x1a2/0x4e0
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d0803559d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
> (XEN) 
> (XEN) Pagetable walk from ffff81bfd79f12d8:
> (XEN)  L4[0x103] = 8000001fd5b8d063 ffffffffffffffff
> (XEN)  L3[0x0ff] = 0000000000000000 ffffffffffffffff
> (XEN) 
> (XEN) Reboot in five seconds...
> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
> 
> The same host also crashed about 2 weeks ago but I had nothing in
> place to record the serial console so I have no logs. There has also
> been one other host crash on a different host but again no
> information collected.
> 
> Longer background:
> 
> Around the weekend of 18 November I deployed a hypervisor built from
> staging-4.10 plus the outstanding XSA patches including XSA-273
> which I had up until then held off on.
> 
> As described in:
> 
>     https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg02811.html
> 
> within a few days I began noticing sporadic memory corruption issues
> in some guests, we established there was a bug in the L1TF fixes,
> and I was able to avoid the problem in affected guests by making
> sure to upgrade their guest kernels so they have Linux's L1TF fixes.
> 
> During first reboot into that hypervisor one of my hosts crashed and
> rebooted, but it went by too fast for me to get any information and
> there wasn't enough scrollback on the serial console.
> 
> Since then, a different host has crashed and rebooted twice. The
> first time I have managed to log it is above.
> 
> I don't think it's a hardware fault, or at least if it is it is only
> being tickled by something added recently. I have absolutely no idea
> it is the case but I can't help feeling it's going to be related to
> L1TF again.
> 
> Do my logs above help at all?
> 
> Is it worth me trying to work out what d31 was at the time and
> taking a closer look at that?
> 
> Production system, problem that occurs weeks apart… could be a bit
> tricky to get to the bottom of.
> 
> The host is a Debian jessie dom0 running kernel version
> linux-image-3.16.0-7-amd64 3.16.59-1. The hardware is a single
> socket Xeon D-1540. The xl info is:
> 
> host                   : hobgoblin
> release                : 3.16.0-7-amd64
> version                : #1 SMP Debian 3.16.59-1 (2018-10-03)
> machine                : x86_64
> nr_cpus                : 16
> max_cpu_id             : 15
> nr_nodes               : 1
> cores_per_socket       : 8
> threads_per_core       : 2
> cpu_mhz                : 2000
> hw_caps                : bfebfbff:77fef3ff:2c100800:00000121:00000001:001cbfbb:00000000:00000100
> virt_caps              : hvm hvm_directio
> total_memory           : 130969
> free_memory            : 4646
> sharing_freed_memory   : 0
> sharing_used_memory    : 0
> outstanding_claims     : 0
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 10
> xen_extra              : .1
> xen_version            : 4.10.1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : fe50b33b07fd447949-x86: write to correct variable in parse_pv_l
> xen_commandline        : placeholder dom0_mem=2048M dom0_max_vcpus=2 com1=115200,8n1,0x2f8,10 console=com1,vga ucode=scan serial_tx_buffer=256k
> cc_compiler            : gcc (Debian 4.9.2-10+deb8u1) 4.9.2
> cc_compile_by          : andy
> cc_compile_domain      : prymar56.org
> cc_compile_date        : Wed Nov  7 16:52:19 UTC 2018
> build_id               : 091f7ab43ab0b6ef9208a2e593c35496517fbe91
> xend_config_format     : 4
> 
> Are there any other hypervisor command line options that would be
> beneficial to set for next time? Unfortunately unless we are very
> sure to get somewhere, or I can isolate a guest that is triggering
> this and put it on test hardware, I don't really want to keep
> rebooting this system. But I can set something so it boots into it
> next time.
> 
> Thanks,
> Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2018-12-10 16:44   ` Andy Smith
@ 2018-12-10 17:12     ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2018-12-10 17:12 UTC (permalink / raw)
  To: Andy Smith; +Cc: xen-devel

>>> On 10.12.18 at 17:44, <andy@strugglers.net> wrote:
> Does setting pcid=0 leave me increasingly vulnerable to Meltdown
> and/or negatively impact performance?

I don't think there's any vulnerability concern with disabling use
of PCID. On hardware without the feature we consider ourselves
sufficiently mitigated after all. As to a performance effect, I can't
exclude it, but as with any such question part of the answer is
also "It'll depend on the workload."

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2018-12-10 16:29 ` Jan Beulich
@ 2018-12-10 16:44   ` Andy Smith
  2018-12-10 17:12     ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Smith @ 2018-12-10 16:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Hi Jan,

On Mon, Dec 10, 2018 at 09:29:34AM -0700, Jan Beulich wrote:
> >>> On 10.12.18 at 16:58, <andy@strugglers.net> wrote:
> > Are there any other hypervisor command line options that would be
> > beneficial to set for next time?
> 
> Well, just like for your report from a couple of weeks ago - if this is
> on PCID/INVPCID capable hardware, have you tried disabling use
> of PCID?

Aside from a quick test at Andrew's suggestion I have not, because I
thought there were negative repercussions of this, and up until this
point it seemed like problems were restricted to guests and could be
avoided by guest kernel upgrade.

The previous issue with the memory corruption in guests was avoided
by booting with pcid=0.

Does setting pcid=0 leave me increasingly vulnerable to Meltdown
and/or negatively impact performance?

Thanks,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 4.10.1 Xen crash and reboot
  2018-12-10 15:58 Andy Smith
@ 2018-12-10 16:29 ` Jan Beulich
  2018-12-10 16:44   ` Andy Smith
  2018-12-21 18:55 ` Andy Smith
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2018-12-10 16:29 UTC (permalink / raw)
  To: Andy Smith; +Cc: xen-devel

>>> On 10.12.18 at 16:58, <andy@strugglers.net> wrote:
> Are there any other hypervisor command line options that would be
> beneficial to set for next time?

Well, just like for your report from a couple of weeks ago - if this is
on PCID/INVPCID capable hardware, have you tried disabling use
of PCID?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* 4.10.1 Xen crash and reboot
@ 2018-12-10 15:58 Andy Smith
  2018-12-10 16:29 ` Jan Beulich
  2018-12-21 18:55 ` Andy Smith
  0 siblings, 2 replies; 11+ messages in thread
From: Andy Smith @ 2018-12-10 15:58 UTC (permalink / raw)
  To: xen-devel

Hi,

Up front information:

Today one of my Xen hosts crashed with this logging on the serial:

(XEN) ----[ Xen-4.10.1  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    15
(XEN) RIP:    e008:[<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d31v1)
(XEN) rax: ffff82e01ecfae80   rbx: 0000000f67d74025   rcx: 0000000000000000
(XEN) rdx: ffff82e000000000   rsi: ffff81bfd79f12d8   rdi: 00000000ffffffff
(XEN) rbp: 0000000000f67d74   rsp: ffff83202628fbd8   r8:  00000000010175c6
(XEN) r9:  0000000000000000   r10: ffff830079592000   r11: 0000000000000000
(XEN) r12: 0000000f67d74025   r13: ffff832020549000   r14: 0000000000f67d74
(XEN) r15: ffff81bfd79f12d8   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 0000001fd5b8d001   cr2: ffff81bfd79f12d8
(XEN) fsb: 00007faf3e71f700   gsb: 0000000000000000   gss: ffff88007f300000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08033db45> (guest_4.o#shadow_set_l1e+0x75/0x6a0):
(XEN)  0f 20 0f 85 23 01 00 00 <4d> 8b 37 4c 39 f3 0f 84 97 01 00 00 49 89 da 89
(XEN) Xen stack trace from rsp=ffff83202628fbd8:
(XEN)    0000000f67d74000 00000000010175c6 0000000000000000 ffff832000000002
(XEN)    ffff830079592000 ffff832020549000 ffff81bfd79f12d8 ffff83202628fef8
(XEN)    00000000010175c6 0000000000f67d74 ffff830079592000 ffff82d08033fc82
(XEN)    8000000fad0dc125 00007faf3e25bba0 ffff832020549600 0000000000f67d74
(XEN)    0000000000f67d74 0000000000f67d74 ffff83202628fd70 ffff83202628fd20
(XEN)    00000007faf3e25b 00000000000000c0 ffff82d0805802c0 0000000220549000
(XEN)    00000000000007f8 00000000000005e0 0000000000000f88 ffff82d0805802c0
(XEN)    00000000010175c6 00007faf3e25bba0 00000000000002d8 000000000000005b
(XEN)    ffff81c0dfebcf88 01ff82d000000000 0000000f67d74025 ffff82d000000000
(XEN)    ffff832020549000 000000010000000d ffff83202628ffff ffff83202628fd20
(XEN)    00000000000000e9 00007faf3e25bba0 0000000f472df067 0000000f49296067
(XEN)    0000000f499f1067 0000000f67d74125 0000000000f498cf 0000000000f472df
(XEN)    0000000000f49296 0000000000f499f1 0000000000000015 ffffffffffffffff
(XEN)    ffff82e03fab71a0 ffff830079593000 ffff82d0803557eb ffff82d08020bf4a
(XEN)    0000000000000000 ffff830079592000 ffff832020549000 ffff83202628fef8
(XEN)    0000000000000002 ffff82d08034e9b0 0000000000633400 ffff82d08034a330
(XEN)    ffff830079592000 ffff83202628ffff ffff830079592000 ffff82d08034eaae
(XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
(XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
(XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
(XEN) Xen call trace:
(XEN)    [<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
(XEN)    [<ffff82d08033fc82>] guest_4.o#sh_page_fault__guest_4+0x8f2/0x2060
(XEN)    [<ffff82d0803557eb>] common_interrupt+0x9b/0x120
(XEN)    [<ffff82d08020bf4a>] evtchn_check_pollers+0x1a/0xb0
(XEN)    [<ffff82d08034e9b0>] do_iret+0/0x1a0
(XEN)    [<ffff82d08034a330>] toggle_guest_pt+0x30/0x160
(XEN)    [<ffff82d08034eaae>] do_iret+0xfe/0x1a0
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0802a16b2>] do_page_fault+0x1a2/0x4e0
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0803559d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
(XEN) 
(XEN) Pagetable walk from ffff81bfd79f12d8:
(XEN)  L4[0x103] = 8000001fd5b8d063 ffffffffffffffff
(XEN)  L3[0x0ff] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

The same host also crashed about 2 weeks ago but I had nothing in
place to record the serial console so I have no logs. There has also
been one other host crash on a different host but again no
information collected.

Longer background:

Around the weekend of 18 November I deployed a hypervisor built from
staging-4.10 plus the outstanding XSA patches including XSA-273
which I had up until then held off on.

As described in:

    https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg02811.html

within a few days I began noticing sporadic memory corruption issues
in some guests, we established there was a bug in the L1TF fixes,
and I was able to avoid the problem in affected guests by making
sure to upgrade their guest kernels so they have Linux's L1TF fixes.

During first reboot into that hypervisor one of my hosts crashed and
rebooted, but it went by too fast for me to get any information and
there wasn't enough scrollback on the serial console.

Since then, a different host has crashed and rebooted twice. The
first time I have managed to log it is above.

I don't think it's a hardware fault, or at least if it is it is only
being tickled by something added recently. I have absolutely no idea
it is the case but I can't help feeling it's going to be related to
L1TF again.

Do my logs above help at all?

Is it worth me trying to work out what d31 was at the time and
taking a closer look at that?

Production system, problem that occurs weeks apart… could be a bit
tricky to get to the bottom of.

The host is a Debian jessie dom0 running kernel version
linux-image-3.16.0-7-amd64 3.16.59-1. The hardware is a single
socket Xeon D-1540. The xl info is:

host                   : hobgoblin
release                : 3.16.0-7-amd64
version                : #1 SMP Debian 3.16.59-1 (2018-10-03)
machine                : x86_64
nr_cpus                : 16
max_cpu_id             : 15
nr_nodes               : 1
cores_per_socket       : 8
threads_per_core       : 2
cpu_mhz                : 2000
hw_caps                : bfebfbff:77fef3ff:2c100800:00000121:00000001:001cbfbb:00000000:00000100
virt_caps              : hvm hvm_directio
total_memory           : 130969
free_memory            : 4646
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 10
xen_extra              : .1
xen_version            : 4.10.1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : fe50b33b07fd447949-x86: write to correct variable in parse_pv_l
xen_commandline        : placeholder dom0_mem=2048M dom0_max_vcpus=2 com1=115200,8n1,0x2f8,10 console=com1,vga ucode=scan serial_tx_buffer=256k
cc_compiler            : gcc (Debian 4.9.2-10+deb8u1) 4.9.2
cc_compile_by          : andy
cc_compile_domain      : prymar56.org
cc_compile_date        : Wed Nov  7 16:52:19 UTC 2018
build_id               : 091f7ab43ab0b6ef9208a2e593c35496517fbe91
xend_config_format     : 4

Are there any other hypervisor command line options that would be
beneficial to set for next time? Unfortunately unless we are very
sure to get somewhere, or I can isolate a guest that is triggering
this and put it on test hardware, I don't really want to keep
rebooting this system. But I can set something so it boots into it
next time.

Thanks,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-02-11 10:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 11:12 4.10.1 Xen crash and reboot Patrick Beckmann
2019-02-11 10:41 ` Patrick Beckmann
  -- strict thread matches above, loose matches on Subject: below --
2018-12-10 15:58 Andy Smith
2018-12-10 16:29 ` Jan Beulich
2018-12-10 16:44   ` Andy Smith
2018-12-10 17:12     ` Jan Beulich
2018-12-21 18:55 ` Andy Smith
2019-01-01 19:46   ` Andy Smith
2019-01-04 10:16     ` Jan Beulich
2019-01-04 12:28       ` Andy Smith
2019-01-30 18:53     ` Andy Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.