linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pstore does not work under xen
@ 2019-09-19 10:26 James Dingwall
  2019-09-19 15:51 ` Luck, Tony
  0 siblings, 1 reply; 9+ messages in thread
From: James Dingwall @ 2019-09-19 10:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Anton Vorontsov, Colin Cross, Tony Luck,
	Boris Ostrovsky, Juergen Gross

Hi,

I have been investigating a regression in our environment where pstore 
(efi-pstore specifically but I suspect this would affect all 
implementations) no longer works after upgrading from a 4.4 to 5.0 
kernel when running under xen.  (This is an Ubuntu kernel but I don't 
think there are patches which affect this area.)

In kernel/panic.c the flow of panic() is roughly:

dump_stack();

atomic_notifier_call_chain(&panic_notifier_list, 0, buf);

kmsg_dump(KMSG_DUMP_PANIC);


pstore registers a kdump callback which would normally be invoked by the 
call to kmsg_dump() however in Xen there is a panic_notifier registered 
which never returns preventing the pstore record being generated.  The 
implementation of xen_panic_event() has changed since v4.4 but I don't 
understand how this may have changed the behaviour.

Adding a couple of printks in arch/x86/xen/enlighten.c:

@@ -277,8 +278,10 @@ void xen_emergency_restart(void)
 static int
 xen_panic_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
+       printk(KERN_WARNING "enter xen_panic_event()\n");
        if (!kexec_crash_loaded())
                xen_reboot(SHUTDOWN_crash);
+       printk(KERN_WARNING "exit xen_panic_event()\n");
        return NOTIFY_DONE;
 }

Only the first is printed when triggering a crash (echo c > 
/proc/sysrq-trigger)

[ 1185.458761] sysrq: SysRq : Trigger a crash
[ 1185.476937] Kernel panic - not syncing: sysrq triggered crash
[ 1185.495747] CPU: 1 PID: 19241 Comm: bash Tainted: P           OE     5.0.0-27-generic #4
[ 1185.513387] Hardware name: HP ProLiant EC200a/ProLiant EC200a, BIOS U26 05/21/2018
[ 1185.530705] Call Trace:
[ 1185.548683]  dump_stack+0x63/0x85
[ 1185.566553]  panic+0xfe/0x2b4
[ 1185.583634]  sysrq_handle_crash+0x15/0x20
[ 1185.600594]  __handle_sysrq+0x9f/0x170
[ 1185.617613]  write_sysrq_trigger+0x34/0x40
[ 1185.634271]  proc_reg_write+0x3e/0x60
[ 1185.651407]  __vfs_write+0x1b/0x40
[ 1185.668140]  vfs_write+0xb1/0x1a0
[ 1185.685087]  ksys_write+0x5c/0xe0
[ 1185.701190]  __x64_sys_write+0x1a/0x20
[ 1185.717719]  do_syscall_64+0x5a/0x120
[ 1185.733839]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1185.749739] RIP: 0033:0x7fe42e4f5154
[ 1185.764726] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[ 1185.797937] RSP: 002b:00007fff092d0358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1185.814701] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe42e4f5154
[ 1185.831241] RDX: 0000000000000002 RSI: 000055c1c224bf10 RDI: 0000000000000001
[ 1185.848405] RBP: 000055c1c224bf10 R08: 000000000000000a R09: 0000000000000001
[ 1185.869816] R10: 000000000000000a R11: 0000000000000246 R12: 00007fe42e7d1760
[ 1185.888686] R13: 0000000000000002 R14: 00007fe42e7cd2a0 R15: 00007fe42e7cc760
[ 1185.905369] Kernel Offset: disabled
[ 1185.920295] enter xen_panic_event()
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
(d1) Checking store ...xenstored: Checking store complete.xenstored: Checking store ...xenstored: Checking store complete.xenstored: Checking store ...xenstored: Checking store complete.xenstored: Checking store ...

Sorry for the long Cc list, I'm not sure who's court this falls in.

Thanks,
James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: pstore does not work under xen
  2019-09-19 10:26 pstore does not work under xen James Dingwall
@ 2019-09-19 15:51 ` Luck, Tony
  2019-09-19 16:14   ` James Dingwall
  0 siblings, 1 reply; 9+ messages in thread
From: Luck, Tony @ 2019-09-19 15:51 UTC (permalink / raw)
  To: James Dingwall, linux-kernel
  Cc: Kees Cook, Anton Vorontsov, Colin Cross, Boris Ostrovsky, Juergen Gross

> I have been investigating a regression in our environment where pstore 
> (efi-pstore specifically but I suspect this would affect all 
> implementations) no longer works after upgrading from a 4.4 to 5.0 
> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
> think there are patches which affect this area.)

I don't have any answer for this ... but want to throw out the idea that
VMM systems could provide some hypercalls to guests to save/return
some blob of memory (perhaps the "save" triggers automagically if the
guest crashes?).

That would provide a much better pstore back end than relying on emulation
of EFI persistent variables (which have severe contraints on size, and don't
support some pstore modes because you can't dynamically update EFI variables
hundreds of times per second).

-Tony

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-19 15:51 ` Luck, Tony
@ 2019-09-19 16:14   ` James Dingwall
  2019-09-19 16:37     ` Boris Ostrovsky
  0 siblings, 1 reply; 9+ messages in thread
From: James Dingwall @ 2019-09-19 16:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Anton Vorontsov, Colin Cross, Boris Ostrovsky,
	Juergen Gross, Luck, Tony

On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
> > I have been investigating a regression in our environment where pstore 
> > (efi-pstore specifically but I suspect this would affect all 
> > implementations) no longer works after upgrading from a 4.4 to 5.0 
> > kernel when running under xen.  (This is an Ubuntu kernel but I don't 
> > think there are patches which affect this area.)
> 
> I don't have any answer for this ... but want to throw out the idea that
> VMM systems could provide some hypercalls to guests to save/return
> some blob of memory (perhaps the "save" triggers automagically if the
> guest crashes?).
> 
> That would provide a much better pstore back end than relying on emulation
> of EFI persistent variables (which have severe contraints on size, and don't
> support some pstore modes because you can't dynamically update EFI variables
> hundreds of times per second).
> 

For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
should probably have also mentioned the xen verion has changed from 4.8.4 to
4.11.2 in case its behaviour on detection of crashed domain has changed.

(For capturing guest crashes we have enabled xenconsole logging so the
hvc0 log is available in dom0.)

Thanks,
James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-19 16:14   ` James Dingwall
@ 2019-09-19 16:37     ` Boris Ostrovsky
  2019-09-23 15:42       ` James Dingwall
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Ostrovsky @ 2019-09-19 16:37 UTC (permalink / raw)
  To: James Dingwall, linux-kernel
  Cc: Kees Cook, Anton Vorontsov, Colin Cross, Juergen Gross, Luck, Tony

On 9/19/19 12:14 PM, James Dingwall wrote:
> On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
>>> I have been investigating a regression in our environment where pstore 
>>> (efi-pstore specifically but I suspect this would affect all 
>>> implementations) no longer works after upgrading from a 4.4 to 5.0 
>>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
>>> think there are patches which affect this area.)
>> I don't have any answer for this ... but want to throw out the idea that
>> VMM systems could provide some hypercalls to guests to save/return
>> some blob of memory (perhaps the "save" triggers automagically if the
>> guest crashes?).
>>
>> That would provide a much better pstore back end than relying on emulation
>> of EFI persistent variables (which have severe contraints on size, and don't
>> support some pstore modes because you can't dynamically update EFI variables
>> hundreds of times per second).
>>
> For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
> should probably have also mentioned the xen verion has changed from 4.8.4 to
> 4.11.2 in case its behaviour on detection of crashed domain has changed.
>
> (For capturing guest crashes we have enabled xenconsole logging so the
> hvc0 log is available in dom0.)


Do you only see this difference between 4.4 and 5.0 when you crash via
sysrq?

Because that's where things changed. On 4.4 we seem to be forcing an
oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
panic() directly from sysrq handler. And because Xen's panic notifier
doesn't return we never get a chance to call kmsg_dump().

-boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-19 16:37     ` Boris Ostrovsky
@ 2019-09-23 15:42       ` James Dingwall
  2019-09-23 22:59         ` Kees Cook
  0 siblings, 1 reply; 9+ messages in thread
From: James Dingwall @ 2019-09-23 15:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Anton Vorontsov, Colin Cross, Juergen Gross, Luck,
	Tony, Boris Ostrovsky

On Thu, Sep 19, 2019 at 12:37:40PM -0400, Boris Ostrovsky wrote:
> On 9/19/19 12:14 PM, James Dingwall wrote:
> > On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
> >>> I have been investigating a regression in our environment where pstore 
> >>> (efi-pstore specifically but I suspect this would affect all 
> >>> implementations) no longer works after upgrading from a 4.4 to 5.0 
> >>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
> >>> think there are patches which affect this area.)
> >> I don't have any answer for this ... but want to throw out the idea that
> >> VMM systems could provide some hypercalls to guests to save/return
> >> some blob of memory (perhaps the "save" triggers automagically if the
> >> guest crashes?).
> >>
> >> That would provide a much better pstore back end than relying on emulation
> >> of EFI persistent variables (which have severe contraints on size, and don't
> >> support some pstore modes because you can't dynamically update EFI variables
> >> hundreds of times per second).
> >>
> > For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
> > should probably have also mentioned the xen verion has changed from 4.8.4 to
> > 4.11.2 in case its behaviour on detection of crashed domain has changed.
> >
> > (For capturing guest crashes we have enabled xenconsole logging so the
> > hvc0 log is available in dom0.)
> 
> 
> Do you only see this difference between 4.4 and 5.0 when you crash via
> sysrq?
> 
> Because that's where things changed. On 4.4 we seem to be forcing an
> oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
> panic() directly from sysrq handler. And because Xen's panic notifier
> doesn't return we never get a chance to call kmsg_dump().
> 

Ok, I see that change in 8341f2f222d729688014ce8306727fdb9798d37e.  I 
hadn't tested it any other way before.  Using the null pointer 
de-reference module code at [1] a pstore record is generated as expected 
when the module is loaded (panic_on_oops=1).

I have also tested swapping the kmsg_dump() / 
atomic_notifier_call_chain() around in panic.c and this also results in 
a pstore record being created with sysrq-c.  I don't know if that would 
be an acceptable solution though since it may break behaviour that other 
things depend on.

James

[1] http://ubuntu.5.x6.nabble.com/How-To-Cause-An-Oops-td3681145.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-23 15:42       ` James Dingwall
@ 2019-09-23 22:59         ` Kees Cook
  2019-09-24  0:41           ` Boris Ostrovsky
  0 siblings, 1 reply; 9+ messages in thread
From: Kees Cook @ 2019-09-23 22:59 UTC (permalink / raw)
  To: James Dingwall
  Cc: linux-kernel, Anton Vorontsov, Colin Cross, Juergen Gross, Luck,
	Tony, Boris Ostrovsky, Matthias Kaehlcke, Greg Kroah-Hartman

On Mon, Sep 23, 2019 at 03:42:27PM +0000, James Dingwall wrote:
> On Thu, Sep 19, 2019 at 12:37:40PM -0400, Boris Ostrovsky wrote:
> > On 9/19/19 12:14 PM, James Dingwall wrote:
> > > On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
> > >>> I have been investigating a regression in our environment where pstore 
> > >>> (efi-pstore specifically but I suspect this would affect all 
> > >>> implementations) no longer works after upgrading from a 4.4 to 5.0 
> > >>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
> > >>> think there are patches which affect this area.)
> > >> I don't have any answer for this ... but want to throw out the idea that
> > >> VMM systems could provide some hypercalls to guests to save/return
> > >> some blob of memory (perhaps the "save" triggers automagically if the
> > >> guest crashes?).
> > >>
> > >> That would provide a much better pstore back end than relying on emulation
> > >> of EFI persistent variables (which have severe contraints on size, and don't
> > >> support some pstore modes because you can't dynamically update EFI variables
> > >> hundreds of times per second).
> > >>
> > > For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
> > > should probably have also mentioned the xen verion has changed from 4.8.4 to
> > > 4.11.2 in case its behaviour on detection of crashed domain has changed.
> > >
> > > (For capturing guest crashes we have enabled xenconsole logging so the
> > > hvc0 log is available in dom0.)
> > 
> > 
> > Do you only see this difference between 4.4 and 5.0 when you crash via
> > sysrq?
> > 
> > Because that's where things changed. On 4.4 we seem to be forcing an
> > oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
> > panic() directly from sysrq handler. And because Xen's panic notifier
> > doesn't return we never get a chance to call kmsg_dump().
> > 
> 
> Ok, I see that change in 8341f2f222d729688014ce8306727fdb9798d37e.  I 
> hadn't tested it any other way before.  Using the null pointer 
> de-reference module code at [1] a pstore record is generated as expected 
> when the module is loaded (panic_on_oops=1).

This change looks correct -- it just gets us directly to the panic()
state instead of exercising the various exception handlers.

> I have also tested swapping the kmsg_dump() / 
> atomic_notifier_call_chain() around in panic.c and this also results in 
> a pstore record being created with sysrq-c.  I don't know if that would 
> be an acceptable solution though since it may break behaviour that other 
> things depend on.

I don't think reordering these is a good idea: as the comments say,
there might be work done in the notifier chain that kmsg_dump() will
want to capture (e.g. the KASLR base offset).

The situation seems to be that notifier callbacks must return -- I think
Xen needs fixing here.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-23 22:59         ` Kees Cook
@ 2019-09-24  0:41           ` Boris Ostrovsky
  2019-09-25 11:01             ` James Dingwall
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Ostrovsky @ 2019-09-24  0:41 UTC (permalink / raw)
  To: Kees Cook, James Dingwall
  Cc: linux-kernel, Anton Vorontsov, Colin Cross, Juergen Gross, Luck,
	Tony, Matthias Kaehlcke, Greg Kroah-Hartman

On 9/23/19 6:59 PM, Kees Cook wrote:
> On Mon, Sep 23, 2019 at 03:42:27PM +0000, James Dingwall wrote:
>> On Thu, Sep 19, 2019 at 12:37:40PM -0400, Boris Ostrovsky wrote:
>>> On 9/19/19 12:14 PM, James Dingwall wrote:
>>>> On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
>>>>>> I have been investigating a regression in our environment where pstore 
>>>>>> (efi-pstore specifically but I suspect this would affect all 
>>>>>> implementations) no longer works after upgrading from a 4.4 to 5.0 
>>>>>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
>>>>>> think there are patches which affect this area.)
>>>>> I don't have any answer for this ... but want to throw out the idea that
>>>>> VMM systems could provide some hypercalls to guests to save/return
>>>>> some blob of memory (perhaps the "save" triggers automagically if the
>>>>> guest crashes?).
>>>>>
>>>>> That would provide a much better pstore back end than relying on emulation
>>>>> of EFI persistent variables (which have severe contraints on size, and don't
>>>>> support some pstore modes because you can't dynamically update EFI variables
>>>>> hundreds of times per second).
>>>>>
>>>> For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
>>>> should probably have also mentioned the xen verion has changed from 4.8.4 to
>>>> 4.11.2 in case its behaviour on detection of crashed domain has changed.
>>>>
>>>> (For capturing guest crashes we have enabled xenconsole logging so the
>>>> hvc0 log is available in dom0.)
>>>
>>> Do you only see this difference between 4.4 and 5.0 when you crash via
>>> sysrq?
>>>
>>> Because that's where things changed. On 4.4 we seem to be forcing an
>>> oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
>>> panic() directly from sysrq handler. And because Xen's panic notifier
>>> doesn't return we never get a chance to call kmsg_dump().
>>>
>> Ok, I see that change in 8341f2f222d729688014ce8306727fdb9798d37e.  I 
>> hadn't tested it any other way before.  Using the null pointer 
>> de-reference module code at [1] a pstore record is generated as expected 
>> when the module is loaded (panic_on_oops=1).
> This change looks correct -- it just gets us directly to the panic()
> state instead of exercising the various exception handlers.
>
>> I have also tested swapping the kmsg_dump() / 
>> atomic_notifier_call_chain() around in panic.c and this also results in 
>> a pstore record being created with sysrq-c.  I don't know if that would 
>> be an acceptable solution though since it may break behaviour that other 
>> things depend on.
> I don't think reordering these is a good idea: as the comments say,
> there might be work done in the notifier chain that kmsg_dump() will
> want to capture (e.g. the KASLR base offset).
>
> The situation seems to be that notifier callbacks must return -- I think
> Xen needs fixing here.
>

I only had one quick sanity test with a PV guest so this needs more
testing. James, can you give it a try?


diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 750f46ad018a..d88f118028b4 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -269,16 +269,17 @@ void xen_reboot(int reason)
                BUG();
 }
 
+static int reboot_reason = SHUTDOWN_reboot;
 void xen_emergency_restart(void)
 {
-       xen_reboot(SHUTDOWN_reboot);
+       xen_reboot(reboot_reason);
 }
 
 static int
 xen_panic_event(struct notifier_block *this, unsigned long event, void
*ptr)
 {
        if (!kexec_crash_loaded())
-               xen_reboot(SHUTDOWN_crash);
+               reboot_reason = SHUTDOWN_crash;
        return NOTIFY_DONE;
 }
 
@@ -290,6 +291,8 @@ static struct notifier_block xen_panic_block = {
 int xen_panic_handler_init(void)
 {
        atomic_notifier_chain_register(&panic_notifier_list,
&xen_panic_block);
+       if (panic_timeout == 0)
+               set_arch_panic_timeout(-1, CONFIG_PANIC_TIMEOUT);
        return 0;
 }



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-24  0:41           ` Boris Ostrovsky
@ 2019-09-25 11:01             ` James Dingwall
  2019-09-25 15:38               ` Boris Ostrovsky
  0 siblings, 1 reply; 9+ messages in thread
From: James Dingwall @ 2019-09-25 11:01 UTC (permalink / raw)
  To: Boris Ostrovsky, Kees Cook, linux-kernel
  Cc: Anton Vorontsov, Colin Cross, Juergen Gross, Luck, Tony,
	Matthias Kaehlcke, Greg Kroah-Hartman

On Mon, Sep 23, 2019 at 08:41:05PM -0400, Boris Ostrovsky wrote:
> On 9/23/19 6:59 PM, Kees Cook wrote:
> > On Mon, Sep 23, 2019 at 03:42:27PM +0000, James Dingwall wrote:
> >> On Thu, Sep 19, 2019 at 12:37:40PM -0400, Boris Ostrovsky wrote:
> >>> On 9/19/19 12:14 PM, James Dingwall wrote:
> >>>> On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
> >>>>>> I have been investigating a regression in our environment where pstore 
> >>>>>> (efi-pstore specifically but I suspect this would affect all 
> >>>>>> implementations) no longer works after upgrading from a 4.4 to 5.0 
> >>>>>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
> >>>>>> think there are patches which affect this area.)
> >>>>> I don't have any answer for this ... but want to throw out the idea that
> >>>>> VMM systems could provide some hypercalls to guests to save/return
> >>>>> some blob of memory (perhaps the "save" triggers automagically if the
> >>>>> guest crashes?).
> >>>>>
> >>>>> That would provide a much better pstore back end than relying on emulation
> >>>>> of EFI persistent variables (which have severe contraints on size, and don't
> >>>>> support some pstore modes because you can't dynamically update EFI variables
> >>>>> hundreds of times per second).
> >>>>>
> >>>> For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
> >>>> should probably have also mentioned the xen verion has changed from 4.8.4 to
> >>>> 4.11.2 in case its behaviour on detection of crashed domain has changed.
> >>>>
> >>>> (For capturing guest crashes we have enabled xenconsole logging so the
> >>>> hvc0 log is available in dom0.)
> >>>
> >>> Do you only see this difference between 4.4 and 5.0 when you crash via
> >>> sysrq?
> >>>
> >>> Because that's where things changed. On 4.4 we seem to be forcing an
> >>> oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
> >>> panic() directly from sysrq handler. And because Xen's panic notifier
> >>> doesn't return we never get a chance to call kmsg_dump().
> >>>
> >> Ok, I see that change in 8341f2f222d729688014ce8306727fdb9798d37e.  I 
> >> hadn't tested it any other way before.  Using the null pointer 
> >> de-reference module code at [1] a pstore record is generated as expected 
> >> when the module is loaded (panic_on_oops=1).
> > This change looks correct -- it just gets us directly to the panic()
> > state instead of exercising the various exception handlers.
> >
> >> I have also tested swapping the kmsg_dump() / 
> >> atomic_notifier_call_chain() around in panic.c and this also results in 
> >> a pstore record being created with sysrq-c.  I don't know if that would 
> >> be an acceptable solution though since it may break behaviour that other 
> >> things depend on.
> > I don't think reordering these is a good idea: as the comments say,
> > there might be work done in the notifier chain that kmsg_dump() will
> > want to capture (e.g. the KASLR base offset).
> >
> > The situation seems to be that notifier callbacks must return -- I think
> > Xen needs fixing here.
> >
> 
> I only had one quick sanity test with a PV guest so this needs more
> testing. James, can you give it a try?

I have tested the patch in a pv domU and in dom0.  The kernel was built 
with a default Ubuntu .config which sets CONFIG_PANIC_TIMEOUT=0. uname 
-r = 5.0.0-27-generic.

In the domU (no custom kernel parameters):

- sysrq-c: I saw my debug printk messages about kmsg_dump() being 
invoked after the traceback and the domain became listed as crashed in 
the xl status output.

- halt -p: clean shutdown

- shutdown -r now: clean reboot


In the domU with panic=15 in kernel parameters:

- sysrq-c: as without panic=15 then final message "Rebooting in 15 
seconds.." printed but domain never rebooted.  Without this patch the 
domain doesn't reboot or print the Rebooting message presumably because 
of the non-returning panic handler.  If that message is reached then I 
think I would expect a reboot.  (In our Linux 4.4 / Xen 4.8.4 
environment no value of panic resulted in reboot.)

- halt -p: clean shutdown

- shutdown -r now: clean reboot


In dom0 with oops=panic panic=15 in the kernel parameters:

- sysrq-c: kmsg_dump() debug messages printed, last linux message 
"Rebooting in 15 seconds..", after 15s "(XEN) Hardware Dom0 crashed: 
rebooting machine in 5 seconds.", after 5s system rebooted.  On the 
next start a pstore record is present as expected.

- halt -p: clean shutdown, no pstore record present on next start.

- shutdown -r now: clean reboot, no pstore record present on next 
start.


In dom0 with panic=0:

- sysrq-c: dom0 crashes, no reboot messages printed from Xen or kernel 
and system hangs.  A pstore record is present on next start.

- halt -p: clean shutdown, no pstore record present on next start.

- shutdown -r now: clean reboot, no pstore record present on next 
start.


In my opinion this patch:
 - fixes the original issue of no pstore record being generated for a 
dom0 panic.
 - respects the dom0 panic=... value.  This is a change in behaviour in 
how xen handles the crashed dom0 from always rebooting to only rebooting 
if panic > 0.
 - causes a Rebooting message to be printed in a crashed pv domU when 
panic > 0 but the domain does not reboot when it should.

James

> 
> 
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 750f46ad018a..d88f118028b4 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -269,16 +269,17 @@ void xen_reboot(int reason)
>                 BUG();
>  }
>  
> +static int reboot_reason = SHUTDOWN_reboot;
>  void xen_emergency_restart(void)
>  {
> -       xen_reboot(SHUTDOWN_reboot);
> +       xen_reboot(reboot_reason);
>  }
>  
>  static int
>  xen_panic_event(struct notifier_block *this, unsigned long event, void
> *ptr)
>  {
>         if (!kexec_crash_loaded())
> -               xen_reboot(SHUTDOWN_crash);
> +               reboot_reason = SHUTDOWN_crash;
>         return NOTIFY_DONE;
>  }
>  
> @@ -290,6 +291,8 @@ static struct notifier_block xen_panic_block = {
>  int xen_panic_handler_init(void)
>  {
>         atomic_notifier_chain_register(&panic_notifier_list,
> &xen_panic_block);
> +       if (panic_timeout == 0)
> +               set_arch_panic_timeout(-1, CONFIG_PANIC_TIMEOUT);
>         return 0;
>  }
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: pstore does not work under xen
  2019-09-25 11:01             ` James Dingwall
@ 2019-09-25 15:38               ` Boris Ostrovsky
  0 siblings, 0 replies; 9+ messages in thread
From: Boris Ostrovsky @ 2019-09-25 15:38 UTC (permalink / raw)
  To: James Dingwall, Kees Cook, linux-kernel
  Cc: Anton Vorontsov, Colin Cross, Juergen Gross, Luck, Tony,
	Matthias Kaehlcke, Greg Kroah-Hartman, xen-devel

On 9/25/19 7:01 AM, James Dingwall wrote:
> On Mon, Sep 23, 2019 at 08:41:05PM -0400, Boris Ostrovsky wrote:
>> On 9/23/19 6:59 PM, Kees Cook wrote:
>>> On Mon, Sep 23, 2019 at 03:42:27PM +0000, James Dingwall wrote:
>>>> On Thu, Sep 19, 2019 at 12:37:40PM -0400, Boris Ostrovsky wrote:
>>>>> On 9/19/19 12:14 PM, James Dingwall wrote:
>>>>>> On Thu, Sep 19, 2019 at 03:51:33PM +0000, Luck, Tony wrote:
>>>>>>>> I have been investigating a regression in our environment where pstore 
>>>>>>>> (efi-pstore specifically but I suspect this would affect all 
>>>>>>>> implementations) no longer works after upgrading from a 4.4 to 5.0 
>>>>>>>> kernel when running under xen.  (This is an Ubuntu kernel but I don't 
>>>>>>>> think there are patches which affect this area.)
>>>>>>> I don't have any answer for this ... but want to throw out the idea that
>>>>>>> VMM systems could provide some hypercalls to guests to save/return
>>>>>>> some blob of memory (perhaps the "save" triggers automagically if the
>>>>>>> guest crashes?).
>>>>>>>
>>>>>>> That would provide a much better pstore back end than relying on emulation
>>>>>>> of EFI persistent variables (which have severe contraints on size, and don't
>>>>>>> support some pstore modes because you can't dynamically update EFI variables
>>>>>>> hundreds of times per second).
>>>>>>>
>>>>>> For clarification this is a dom0 crash rather than an HVM guest with EFI.  I
>>>>>> should probably have also mentioned the xen verion has changed from 4.8.4 to
>>>>>> 4.11.2 in case its behaviour on detection of crashed domain has changed.
>>>>>>
>>>>>> (For capturing guest crashes we have enabled xenconsole logging so the
>>>>>> hvc0 log is available in dom0.)
>>>>> Do you only see this difference between 4.4 and 5.0 when you crash via
>>>>> sysrq?
>>>>>
>>>>> Because that's where things changed. On 4.4 we seem to be forcing an
>>>>> oops, which eventually calls kmsg_dump() and then panic. On 5.0 we call
>>>>> panic() directly from sysrq handler. And because Xen's panic notifier
>>>>> doesn't return we never get a chance to call kmsg_dump().
>>>>>
>>>> Ok, I see that change in 8341f2f222d729688014ce8306727fdb9798d37e.  I 
>>>> hadn't tested it any other way before.  Using the null pointer 
>>>> de-reference module code at [1] a pstore record is generated as expected 
>>>> when the module is loaded (panic_on_oops=1).
>>> This change looks correct -- it just gets us directly to the panic()
>>> state instead of exercising the various exception handlers.
>>>
>>>> I have also tested swapping the kmsg_dump() / 
>>>> atomic_notifier_call_chain() around in panic.c and this also results in 
>>>> a pstore record being created with sysrq-c.  I don't know if that would 
>>>> be an acceptable solution though since it may break behaviour that other 
>>>> things depend on.
>>> I don't think reordering these is a good idea: as the comments say,
>>> there might be work done in the notifier chain that kmsg_dump() will
>>> want to capture (e.g. the KASLR base offset).
>>>
>>> The situation seems to be that notifier callbacks must return -- I think
>>> Xen needs fixing here.
>>>
>> I only had one quick sanity test with a PV guest so this needs more
>> testing. James, can you give it a try?
> I have tested the patch in a pv domU and in dom0.  The kernel was built 
> with a default Ubuntu .config which sets CONFIG_PANIC_TIMEOUT=0. uname 
> -r = 5.0.0-27-generic.
>
> In the domU (no custom kernel parameters):
>
> - sysrq-c: I saw my debug printk messages about kmsg_dump() being 
> invoked after the traceback and the domain became listed as crashed in 
> the xl status output.
>
> - halt -p: clean shutdown
>
> - shutdown -r now: clean reboot
>
>
> In the domU with panic=15 in kernel parameters:
>
> - sysrq-c: as without panic=15 then final message "Rebooting in 15 
> seconds.." printed but domain never rebooted.  Without this patch the 
> domain doesn't reboot or print the Rebooting message presumably because 
> of the non-returning panic handler.  If that message is reached then I 
> think I would expect a reboot.  (In our Linux 4.4 / Xen 4.8.4 
> environment no value of panic resulted in reboot.)

(+xen-devel which I just noticed has not been copied on this thread)

Whether a guest is rebooted is controlled by on_crash parameter in the
config file. By default it's 'destroy'.


>
> - halt -p: clean shutdown
>
> - shutdown -r now: clean reboot
>
>
> In dom0 with oops=panic panic=15 in the kernel parameters:
>
> - sysrq-c: kmsg_dump() debug messages printed, last linux message 
> "Rebooting in 15 seconds..", after 15s "(XEN) Hardware Dom0 crashed: 
> rebooting machine in 5 seconds.", after 5s system rebooted.  On the 
> next start a pstore record is present as expected.
>
> - halt -p: clean shutdown, no pstore record present on next start.
>
> - shutdown -r now: clean reboot, no pstore record present on next 
> start.
>
>
> In dom0 with panic=0:
>
> - sysrq-c: dom0 crashes, no reboot messages printed from Xen or kernel 
> and system hangs.  A pstore record is present on next start.
>
> - halt -p: clean shutdown, no pstore record present on next start.
>
> - shutdown -r now: clean reboot, no pstore record present on next 
> start.
>
>
> In my opinion this patch:
>  - fixes the original issue of no pstore record being generated for a 
> dom0 panic.
>  - respects the dom0 panic=... value.  This is a change in behaviour in 
> how xen handles the crashed dom0 from always rebooting to only rebooting 
> if panic > 0.

Even thought this is now correct behavior it's a change from what we've
always had. And I think we should preserve that.

Meaning that we have to ignore if panic_timeout is set to zero either on
boot line or via sysfs.


-boris

>  - causes a Rebooting message to be printed in a crashed pv domU when 
> panic > 0 but the domain does not reboot when it should.
>
> James
>
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 750f46ad018a..d88f118028b4 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -269,16 +269,17 @@ void xen_reboot(int reason)
>>                 BUG();
>>  }
>>  
>> +static int reboot_reason = SHUTDOWN_reboot;
>>  void xen_emergency_restart(void)
>>  {
>> -       xen_reboot(SHUTDOWN_reboot);
>> +       xen_reboot(reboot_reason);
>>  }
>>  
>>  static int
>>  xen_panic_event(struct notifier_block *this, unsigned long event, void
>> *ptr)
>>  {
>>         if (!kexec_crash_loaded())
>> -               xen_reboot(SHUTDOWN_crash);
>> +               reboot_reason = SHUTDOWN_crash;
>>         return NOTIFY_DONE;
>>  }
>>  
>> @@ -290,6 +291,8 @@ static struct notifier_block xen_panic_block = {
>>  int xen_panic_handler_init(void)
>>  {
>>         atomic_notifier_chain_register(&panic_notifier_list,
>> &xen_panic_block);
>> +       if (panic_timeout == 0)
>> +               set_arch_panic_timeout(-1, CONFIG_PANIC_TIMEOUT);
>>         return 0;
>>  }
>>
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-25 15:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19 10:26 pstore does not work under xen James Dingwall
2019-09-19 15:51 ` Luck, Tony
2019-09-19 16:14   ` James Dingwall
2019-09-19 16:37     ` Boris Ostrovsky
2019-09-23 15:42       ` James Dingwall
2019-09-23 22:59         ` Kees Cook
2019-09-24  0:41           ` Boris Ostrovsky
2019-09-25 11:01             ` James Dingwall
2019-09-25 15:38               ` Boris Ostrovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).