linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mgag200 fails kdump kernel booting
@ 2019-06-26  8:15 Baoquan He
  2019-06-26  8:29 ` Baoquan He
  2019-07-02  2:21 ` Dave Young
  0 siblings, 2 replies; 10+ messages in thread
From: Baoquan He @ 2019-06-26  8:15 UTC (permalink / raw)
  To: airlied; +Cc: kexec, x86, linux-kernel, dyoung

Hi Dave,

We met an kdump kernel boot failure on a lenovo system. Kdump kernel
failed to boot, but just reset to firmware to reboot system. And nothing
is printed out.

The machine is a big server, with 6T memory and many cpu, its graphic
driver module is mgag200.

When added 'earlyprintk=ttyS0' into kernel command line, it printed
out only one line to console during kdump kernel booting:
     KASLR disabled: 'nokaslr' on cmdline.

Then reset to firmware to reboot system.

By further code debugging, the failure happened in
arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
triggered by the vga printing. As you can see, in __putstr() of
arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
specified, and print out to the target. And no matter if earlyprintk= is
added or not, it will print to VGA. And printing to VGA caused it to
reset to firmware. That's why we see nothing when didn't specify
earlyprintk=, but see only one line of printing about the 'KASLR
disabled'.

To confirm it's caused by VGA printing, I blacklist the mgag200 by
writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
boot up successfully. And add 'nomodeset' can also make it work. So it's
for sure mgag driver or related code have something wrong when booting
code tries to re-init it.

This is the only case we ever see, tend to pursuit fix in mgag200 driver
side. Any idea or suggestion? We have two machines to be able to
reproduce it stablly.

Thanks
Baoquan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-06-26  8:15 mgag200 fails kdump kernel booting Baoquan He
@ 2019-06-26  8:29 ` Baoquan He
  2019-07-01 20:51   ` David Airlie
  2019-07-02  2:21 ` Dave Young
  1 sibling, 1 reply; 10+ messages in thread
From: Baoquan He @ 2019-06-26  8:29 UTC (permalink / raw)
  To: airlied; +Cc: kexec, x86, linux-kernel, dyoung

On 06/26/19 at 04:15pm, Baoquan He wrote:
> Hi Dave,
> 
> We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> failed to boot, but just reset to firmware to reboot system. And nothing
> is printed out.
> 
> The machine is a big server, with 6T memory and many cpu, its graphic
> driver module is mgag200.
> 
> When added 'earlyprintk=ttyS0' into kernel command line, it printed
> out only one line to console during kdump kernel booting:
>      KASLR disabled: 'nokaslr' on cmdline.
> 
> Then reset to firmware to reboot system.
> 
> By further code debugging, the failure happened in
> arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> triggered by the vga printing. As you can see, in __putstr() of
> arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> specified, and print out to the target. And no matter if earlyprintk= is
> added or not, it will print to VGA. And printing to VGA caused it to
> reset to firmware. That's why we see nothing when didn't specify
> earlyprintk=, but see only one line of printing about the 'KASLR
> disabled'.

Here I mean:
That's why we see nothing when didn't specify earlyprintk=, but see only
one line of printing about the 'KASLR disabled' message when
earlyprintk=ttyS0 added.

> 
> To confirm it's caused by VGA printing, I blacklist the mgag200 by
> writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> boot up successfully. And add 'nomodeset' can also make it work. So it's
> for sure mgag driver or related code have something wrong when booting
> code tries to re-init it.
> 
> This is the only case we ever see, tend to pursuit fix in mgag200 driver
> side. Any idea or suggestion? We have two machines to be able to
> reproduce it stablly.
> 
> Thanks
> Baoquan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-06-26  8:29 ` Baoquan He
@ 2019-07-01 20:51   ` David Airlie
  2019-07-02  1:41     ` Baoquan He
  2020-02-05  7:31     ` Baoquan He
  0 siblings, 2 replies; 10+ messages in thread
From: David Airlie @ 2019-07-01 20:51 UTC (permalink / raw)
  To: Baoquan He; +Cc: kexec, x86, linux-kernel, dyoung, Lyude Paul

On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 06/26/19 at 04:15pm, Baoquan He wrote:
> > Hi Dave,
> >
> > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > failed to boot, but just reset to firmware to reboot system. And nothing
> > is printed out.
> >
> > The machine is a big server, with 6T memory and many cpu, its graphic
> > driver module is mgag200.
> >
> > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > out only one line to console during kdump kernel booting:
> >      KASLR disabled: 'nokaslr' on cmdline.
> >
> > Then reset to firmware to reboot system.
> >
> > By further code debugging, the failure happened in
> > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > triggered by the vga printing. As you can see, in __putstr() of
> > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > specified, and print out to the target. And no matter if earlyprintk= is
> > added or not, it will print to VGA. And printing to VGA caused it to
> > reset to firmware. That's why we see nothing when didn't specify
> > earlyprintk=, but see only one line of printing about the 'KASLR
> > disabled'.
>
> Here I mean:
> That's why we see nothing when didn't specify earlyprintk=, but see only
> one line of printing about the 'KASLR disabled' message when
> earlyprintk=ttyS0 added.

Just to clarify, the original kernel is booted with mgag200 turned
off, then kexec works, but if the original kernel loads mgag200, the
kexec kernels resets hard when the VGA is used to write stuff out.

This *might* be fixable in the controlled kexec case, but having an
mgag200 shutdown path that tries to put the gpu back into a state
where VGA doesn't die, but for the uncontrolled kexec it'll still be a
problem, since once the gpu is up and running and VGA is disabled, it
doesn't expect to see anymore VGA transactions.

Dave.
>
> >
> > To confirm it's caused by VGA printing, I blacklist the mgag200 by
> > writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> > boot up successfully. And add 'nomodeset' can also make it work. So it's
> > for sure mgag driver or related code have something wrong when booting
> > code tries to re-init it.
> >
> > This is the only case we ever see, tend to pursuit fix in mgag200 driver
> > side. Any idea or suggestion? We have two machines to be able to
> > reproduce it stablly.
> >
> > Thanks
> > Baoquan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-01 20:51   ` David Airlie
@ 2019-07-02  1:41     ` Baoquan He
  2019-07-02  3:17       ` Dave Young
  2020-02-05  7:31     ` Baoquan He
  1 sibling, 1 reply; 10+ messages in thread
From: Baoquan He @ 2019-07-02  1:41 UTC (permalink / raw)
  To: David Airlie; +Cc: kexec, x86, linux-kernel, dyoung, Lyude Paul

On 07/02/19 at 06:51am, David Airlie wrote:
> On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > Hi Dave,
> > >
> > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > is printed out.
> > >
> > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > driver module is mgag200.
> > >
> > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > out only one line to console during kdump kernel booting:
> > >      KASLR disabled: 'nokaslr' on cmdline.
> > >
> > > Then reset to firmware to reboot system.
> > >
> > > By further code debugging, the failure happened in
> > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > triggered by the vga printing. As you can see, in __putstr() of
> > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > specified, and print out to the target. And no matter if earlyprintk= is
> > > added or not, it will print to VGA. And printing to VGA caused it to
> > > reset to firmware. That's why we see nothing when didn't specify
> > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > disabled'.
> >
> > Here I mean:
> > That's why we see nothing when didn't specify earlyprintk=, but see only
> > one line of printing about the 'KASLR disabled' message when
> > earlyprintk=ttyS0 added.
> 
> Just to clarify, the original kernel is booted with mgag200 turned
> off, then kexec works, but if the original kernel loads mgag200, the
> kexec kernels resets hard when the VGA is used to write stuff out.

Thanks for looking into this, Dave.

Yeah, in fact the issue was found in kdump kernel. I haven't checked the
kexec jumping. Kexec jumping will call device_shutdown() to attempt to
shutdown all devices before jumping to the 2nd kernel. But kdump jumping
won't.

> 
> This *might* be fixable in the controlled kexec case, but having an
> mgag200 shutdown path that tries to put the gpu back into a state
> where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> problem, since once the gpu is up and running and VGA is disabled, it
> doesn't expect to see anymore VGA transactions.

Yes, I see. It should have been shutdown by device_shutdown() in kexec
case. The uncontrolled case, I guess you mean the kdump case. In
kdump case, we don't call device_shutdown() before jumping because the
1st kernel has been in crashed state, we just want to switch to kdump
kernel asap. So wondering how other GPU/VGA device/driver bebahve,
currently haven't got report about them. Probably mgag200 is very new,
or we may not meet them. This issue was met on a new bought server.

Thanks
Baoquan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-06-26  8:15 mgag200 fails kdump kernel booting Baoquan He
  2019-06-26  8:29 ` Baoquan He
@ 2019-07-02  2:21 ` Dave Young
  2019-07-02  2:47   ` Baoquan He
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Young @ 2019-07-02  2:21 UTC (permalink / raw)
  To: Baoquan He; +Cc: airlied, kexec, x86, linux-kernel

On 06/26/19 at 04:15pm, Baoquan He wrote:
> Hi Dave,
> 
> We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> failed to boot, but just reset to firmware to reboot system. And nothing
> is printed out.
> 
> The machine is a big server, with 6T memory and many cpu, its graphic
> driver module is mgag200.
> 
> When added 'earlyprintk=ttyS0' into kernel command line, it printed
> out only one line to console during kdump kernel booting:
>      KASLR disabled: 'nokaslr' on cmdline.
> 
> Then reset to firmware to reboot system.
> 
> By further code debugging, the failure happened in
> arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> triggered by the vga printing. As you can see, in __putstr() of
> arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> specified, and print out to the target. And no matter if earlyprintk= is
> added or not, it will print to VGA. And printing to VGA caused it to
> reset to firmware. That's why we see nothing when didn't specify
> earlyprintk=, but see only one line of printing about the 'KASLR
> disabled'.
> 
> To confirm it's caused by VGA printing, I blacklist the mgag200 by
> writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> boot up successfully. And add 'nomodeset' can also make it work. So it's
> for sure mgag driver or related code have something wrong when booting
> code tries to re-init it.
> 
> This is the only case we ever see, tend to pursuit fix in mgag200 driver
> side. Any idea or suggestion? We have two machines to be able to
> reproduce it stablly.

Personally I think early code should not blindly do vga writing, there
are cases that does not work:
1. efi booted machine,  just no output
2. kdump kernel booted,  writing to vga caused undefined state, for
example in your case it caused a system reset.

So I suggest only write to vga when we see earlyprintk=vga in kernel
cmdline.

Thanks
Dave

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-02  2:21 ` Dave Young
@ 2019-07-02  2:47   ` Baoquan He
  0 siblings, 0 replies; 10+ messages in thread
From: Baoquan He @ 2019-07-02  2:47 UTC (permalink / raw)
  To: Dave Young; +Cc: airlied, kexec, x86, linux-kernel

On 07/02/19 at 10:21am, Dave Young wrote:
> On 06/26/19 at 04:15pm, Baoquan He wrote:
> > Hi Dave,
> > 
> > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > failed to boot, but just reset to firmware to reboot system. And nothing
> > is printed out.
> > 
> > The machine is a big server, with 6T memory and many cpu, its graphic
> > driver module is mgag200.
> > 
> > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > out only one line to console during kdump kernel booting:
> >      KASLR disabled: 'nokaslr' on cmdline.
> > 
> > Then reset to firmware to reboot system.
> > 
> > By further code debugging, the failure happened in
> > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > triggered by the vga printing. As you can see, in __putstr() of
> > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > specified, and print out to the target. And no matter if earlyprintk= is
> > added or not, it will print to VGA. And printing to VGA caused it to
> > reset to firmware. That's why we see nothing when didn't specify
> > earlyprintk=, but see only one line of printing about the 'KASLR
> > disabled'.
> > 
> > To confirm it's caused by VGA printing, I blacklist the mgag200 by
> > writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> > boot up successfully. And add 'nomodeset' can also make it work. So it's
> > for sure mgag driver or related code have something wrong when booting
> > code tries to re-init it.
> > 
> > This is the only case we ever see, tend to pursuit fix in mgag200 driver
> > side. Any idea or suggestion? We have two machines to be able to
> > reproduce it stablly.
> 
> Personally I think early code should not blindly do vga writing, there
> are cases that does not work:
> 1. efi booted machine,  just no output
> 2. kdump kernel booted,  writing to vga caused undefined state, for
> example in your case it caused a system reset.
> 
> So I suggest only write to vga when we see earlyprintk=vga in kernel
> cmdline.

I remember one customer ever attached a picture of kernel booting hang
from monitor. I planned to disable vga when it's not specified, but
changed my mind because not all machines are servers w/o monitor. Still
there are many people using laptop, PC, they have vga printing, possibly
have no console. When crash happened, maybe randomly, the vga printing
could be the only witness. In above listed cases, case 1 doesn't output,
seems efi need be fixed, but I can't see why it matters here. About case
2, do you have a specific example, except of this one? Printing to vga
has been done so long time, if it does cause troubles, we need to mute
it now.

Thanks
Baoquan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-02  1:41     ` Baoquan He
@ 2019-07-02  3:17       ` Dave Young
  2019-07-02  5:34         ` Baoquan He
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Young @ 2019-07-02  3:17 UTC (permalink / raw)
  To: Baoquan He; +Cc: David Airlie, kexec, x86, linux-kernel, Lyude Paul

On 07/02/19 at 09:41am, Baoquan He wrote:
> On 07/02/19 at 06:51am, David Airlie wrote:
> > On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > > Hi Dave,
> > > >
> > > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > > is printed out.
> > > >
> > > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > > driver module is mgag200.
> > > >
> > > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > > out only one line to console during kdump kernel booting:
> > > >      KASLR disabled: 'nokaslr' on cmdline.
> > > >
> > > > Then reset to firmware to reboot system.
> > > >
> > > > By further code debugging, the failure happened in
> > > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > > triggered by the vga printing. As you can see, in __putstr() of
> > > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > > specified, and print out to the target. And no matter if earlyprintk= is
> > > > added or not, it will print to VGA. And printing to VGA caused it to
> > > > reset to firmware. That's why we see nothing when didn't specify
> > > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > > disabled'.
> > >
> > > Here I mean:
> > > That's why we see nothing when didn't specify earlyprintk=, but see only
> > > one line of printing about the 'KASLR disabled' message when
> > > earlyprintk=ttyS0 added.
> > 
> > Just to clarify, the original kernel is booted with mgag200 turned
> > off, then kexec works, but if the original kernel loads mgag200, the
> > kexec kernels resets hard when the VGA is used to write stuff out.
> 
> Thanks for looking into this, Dave.
> 
> Yeah, in fact the issue was found in kdump kernel. I haven't checked the
> kexec jumping. Kexec jumping will call device_shutdown() to attempt to
> shutdown all devices before jumping to the 2nd kernel. But kdump jumping
> won't.
> 
> > 
> > This *might* be fixable in the controlled kexec case, but having an
> > mgag200 shutdown path that tries to put the gpu back into a state
> > where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> > problem, since once the gpu is up and running and VGA is disabled, it
> > doesn't expect to see anymore VGA transactions.
> 
> Yes, I see. It should have been shutdown by device_shutdown() in kexec
> case. The uncontrolled case, I guess you mean the kdump case. In
> kdump case, we don't call device_shutdown() before jumping because the
> 1st kernel has been in crashed state, we just want to switch to kdump
> kernel asap. So wondering how other GPU/VGA device/driver bebahve,
> currently haven't got report about them. Probably mgag200 is very new,
> or we may not meet them. This issue was met on a new bought server.

I assumed the vga writing only take effect when earlyprintk is provided.
eg. earlyprintk=ttyS0, then x86 early decompress code will write to both
vga and ttyS0.  So if one does not use earlyprintk, he/she still get
nothing.  But if one provides earlyprintk, then he/she should provide a
correct param he want, instead of blindly assume kernel will write to
vga even if he use ttyS0.

Thanks
Dave

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-02  3:17       ` Dave Young
@ 2019-07-02  5:34         ` Baoquan He
  2019-07-02  7:42           ` Dave Young
  0 siblings, 1 reply; 10+ messages in thread
From: Baoquan He @ 2019-07-02  5:34 UTC (permalink / raw)
  To: Dave Young; +Cc: David Airlie, kexec, x86, linux-kernel, Lyude Paul

On 07/02/19 at 11:17am, Dave Young wrote:
> On 07/02/19 at 09:41am, Baoquan He wrote:
> > On 07/02/19 at 06:51am, David Airlie wrote:
> > > On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > > > Hi Dave,
> > > > >
> > > > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > > > is printed out.
> > > > >
> > > > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > > > driver module is mgag200.
> > > > >
> > > > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > > > out only one line to console during kdump kernel booting:
> > > > >      KASLR disabled: 'nokaslr' on cmdline.
> > > > >
> > > > > Then reset to firmware to reboot system.
> > > > >
> > > > > By further code debugging, the failure happened in
> > > > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > > > triggered by the vga printing. As you can see, in __putstr() of
> > > > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > > > specified, and print out to the target. And no matter if earlyprintk= is
> > > > > added or not, it will print to VGA. And printing to VGA caused it to
> > > > > reset to firmware. That's why we see nothing when didn't specify
> > > > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > > > disabled'.
> > > >
> > > > Here I mean:
> > > > That's why we see nothing when didn't specify earlyprintk=, but see only
> > > > one line of printing about the 'KASLR disabled' message when
> > > > earlyprintk=ttyS0 added.
> > > 
> > > Just to clarify, the original kernel is booted with mgag200 turned
> > > off, then kexec works, but if the original kernel loads mgag200, the
> > > kexec kernels resets hard when the VGA is used to write stuff out.
> > 
> > Thanks for looking into this, Dave.
> > 
> > Yeah, in fact the issue was found in kdump kernel. I haven't checked the
> > kexec jumping. Kexec jumping will call device_shutdown() to attempt to
> > shutdown all devices before jumping to the 2nd kernel. But kdump jumping
> > won't.
> > 
> > > 
> > > This *might* be fixable in the controlled kexec case, but having an
> > > mgag200 shutdown path that tries to put the gpu back into a state
> > > where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> > > problem, since once the gpu is up and running and VGA is disabled, it
> > > doesn't expect to see anymore VGA transactions.
> > 
> > Yes, I see. It should have been shutdown by device_shutdown() in kexec
> > case. The uncontrolled case, I guess you mean the kdump case. In
> > kdump case, we don't call device_shutdown() before jumping because the
> > 1st kernel has been in crashed state, we just want to switch to kdump
> > kernel asap. So wondering how other GPU/VGA device/driver bebahve,
> > currently haven't got report about them. Probably mgag200 is very new,
> > or we may not meet them. This issue was met on a new bought server.
> 
> I assumed the vga writing only take effect when earlyprintk is provided.
> eg. earlyprintk=ttyS0, then x86 early decompress code will write to both
> vga and ttyS0.  So if one does not use earlyprintk, he/she still get
> nothing.  But if one provides earlyprintk, then he/she should provide a
> correct param he want, instead of blindly assume kernel will write to
> vga even if he use ttyS0.

No, the vga printing takes effect always, otherwise those warn() and
error() won't work. It takes effect no matter if CONFIG_EARLY_PRINTK
is enabled, and if any earlyprintk= specified.

That's why I prefer to pursuit fix in driver side. It's making the
error/warn print out even though nothing specific needed, that's make
sense to me.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-02  5:34         ` Baoquan He
@ 2019-07-02  7:42           ` Dave Young
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Young @ 2019-07-02  7:42 UTC (permalink / raw)
  To: Baoquan He; +Cc: David Airlie, kexec, x86, linux-kernel, Lyude Paul

On 07/02/19 at 01:34pm, Baoquan He wrote:
> On 07/02/19 at 11:17am, Dave Young wrote:
> > On 07/02/19 at 09:41am, Baoquan He wrote:
> > > On 07/02/19 at 06:51am, David Airlie wrote:
> > > > On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
> > > > >
> > > > > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > > > > Hi Dave,
> > > > > >
> > > > > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > > > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > > > > is printed out.
> > > > > >
> > > > > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > > > > driver module is mgag200.
> > > > > >
> > > > > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > > > > out only one line to console during kdump kernel booting:
> > > > > >      KASLR disabled: 'nokaslr' on cmdline.
> > > > > >
> > > > > > Then reset to firmware to reboot system.
> > > > > >
> > > > > > By further code debugging, the failure happened in
> > > > > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > > > > triggered by the vga printing. As you can see, in __putstr() of
> > > > > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > > > > specified, and print out to the target. And no matter if earlyprintk= is
> > > > > > added or not, it will print to VGA. And printing to VGA caused it to
> > > > > > reset to firmware. That's why we see nothing when didn't specify
> > > > > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > > > > disabled'.
> > > > >
> > > > > Here I mean:
> > > > > That's why we see nothing when didn't specify earlyprintk=, but see only
> > > > > one line of printing about the 'KASLR disabled' message when
> > > > > earlyprintk=ttyS0 added.
> > > > 
> > > > Just to clarify, the original kernel is booted with mgag200 turned
> > > > off, then kexec works, but if the original kernel loads mgag200, the
> > > > kexec kernels resets hard when the VGA is used to write stuff out.
> > > 
> > > Thanks for looking into this, Dave.
> > > 
> > > Yeah, in fact the issue was found in kdump kernel. I haven't checked the
> > > kexec jumping. Kexec jumping will call device_shutdown() to attempt to
> > > shutdown all devices before jumping to the 2nd kernel. But kdump jumping
> > > won't.
> > > 
> > > > 
> > > > This *might* be fixable in the controlled kexec case, but having an
> > > > mgag200 shutdown path that tries to put the gpu back into a state
> > > > where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> > > > problem, since once the gpu is up and running and VGA is disabled, it
> > > > doesn't expect to see anymore VGA transactions.
> > > 
> > > Yes, I see. It should have been shutdown by device_shutdown() in kexec
> > > case. The uncontrolled case, I guess you mean the kdump case. In
> > > kdump case, we don't call device_shutdown() before jumping because the
> > > 1st kernel has been in crashed state, we just want to switch to kdump
> > > kernel asap. So wondering how other GPU/VGA device/driver bebahve,
> > > currently haven't got report about them. Probably mgag200 is very new,
> > > or we may not meet them. This issue was met on a new bought server.
> > 
> > I assumed the vga writing only take effect when earlyprintk is provided.
> > eg. earlyprintk=ttyS0, then x86 early decompress code will write to both
> > vga and ttyS0.  So if one does not use earlyprintk, he/she still get
> > nothing.  But if one provides earlyprintk, then he/she should provide a
> > correct param he want, instead of blindly assume kernel will write to
> > vga even if he use ttyS0.
> 
> No, the vga printing takes effect always, otherwise those warn() and
> error() won't work. It takes effect no matter if CONFIG_EARLY_PRINTK
> is enabled, and if any earlyprintk= specified.
> 
> That's why I prefer to pursuit fix in driver side. It's making the
> error/warn print out even though nothing specific needed, that's make
> sense to me.

Ok, thanks for explanation.  A driver fix is better.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgag200 fails kdump kernel booting
  2019-07-01 20:51   ` David Airlie
  2019-07-02  1:41     ` Baoquan He
@ 2020-02-05  7:31     ` Baoquan He
  1 sibling, 0 replies; 10+ messages in thread
From: Baoquan He @ 2020-02-05  7:31 UTC (permalink / raw)
  To: David Airlie, Lyude Paul; +Cc: kexec, x86, linux-kernel, dyoung

Hi Dave, Lyude,

On 07/02/19 at 06:51am, David Airlie wrote:
> On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > Hi Dave,
> > >
> > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > is printed out.
> > >
> > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > driver module is mgag200.
> > >
> > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > out only one line to console during kdump kernel booting:
> > >      KASLR disabled: 'nokaslr' on cmdline.
> > >
> > > Then reset to firmware to reboot system.
> > >
> > > By further code debugging, the failure happened in
> > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > triggered by the vga printing. As you can see, in __putstr() of
> > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > specified, and print out to the target. And no matter if earlyprintk= is
> > > added or not, it will print to VGA. And printing to VGA caused it to
> > > reset to firmware. That's why we see nothing when didn't specify
> > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > disabled'.
> >
> > Here I mean:
> > That's why we see nothing when didn't specify earlyprintk=, but see only
> > one line of printing about the 'KASLR disabled' message when
> > earlyprintk=ttyS0 added.
> 
> Just to clarify, the original kernel is booted with mgag200 turned
> off, then kexec works, but if the original kernel loads mgag200, the
> kexec kernels resets hard when the VGA is used to write stuff out.
> 
> This *might* be fixable in the controlled kexec case, but having an
> mgag200 shutdown path that tries to put the gpu back into a state
> where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> problem, since once the gpu is up and running and VGA is disabled, it
> doesn't expect to see anymore VGA transactions.

Now we have got other two bug reports on different systems, finally
figured out it's the same issue as this after debugging. And adding
'nomodeset' can work around it.

With the help from our QA, tried to get more systems with mgag200,
seems not all of them have this issue, some of them with mgag200 can
jump to kdump well after panic.

Any suggestion about how to proceed? I can experiment. Or if you would
like to have a look when convenient, I can get one system to you to
check. Or, can we just use 'nomodeset' as work around and hold this
issue for the time being?

Appreciate if any suggestion or idea.

> 
> Dave.
> >
> > >
> > > To confirm it's caused by VGA printing, I blacklist the mgag200 by
> > > writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> > > boot up successfully. And add 'nomodeset' can also make it work. So it's
> > > for sure mgag driver or related code have something wrong when booting
> > > code tries to re-init it.
> > >
> > > This is the only case we ever see, tend to pursuit fix in mgag200 driver
> > > side. Any idea or suggestion? We have two machines to be able to
> > > reproduce it stablly.
> > >
> > > Thanks
> > > Baoquan


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-02-05  7:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-26  8:15 mgag200 fails kdump kernel booting Baoquan He
2019-06-26  8:29 ` Baoquan He
2019-07-01 20:51   ` David Airlie
2019-07-02  1:41     ` Baoquan He
2019-07-02  3:17       ` Dave Young
2019-07-02  5:34         ` Baoquan He
2019-07-02  7:42           ` Dave Young
2020-02-05  7:31     ` Baoquan He
2019-07-02  2:21 ` Dave Young
2019-07-02  2:47   ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).