All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug: Fatal errors result in infinite stream of error messages
@ 2014-12-03  9:11 rui wang
  2014-12-03 20:52 ` Borislav Petkov
  0 siblings, 1 reply; 3+ messages in thread
From: rui wang @ 2014-12-03  9:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, bp, aris, rui.y.wang

Hi all,

When a machine check panics while the kdump service isn't loaded (e.g.
due to insufficient disk space), we see an infinite stream of error
messages on the console, repeatedly, like this (The machine can never
reboot):

[   82.733050] bad: scheduling from the idle thread!
[   82.738304] CPU: 85 PID: 0 Comm: swapper/85 Tainted: G   M        E
 3.18.0-rc4-7-default+ #33
[   82.747921] Hardware name: Intel Corporation BRICKLAND/BRICKLAND,
BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[   82.759480]  ffff88047f873280 ffff88047f869878 ffffffff81566db9
0000000000000000
[   82.767788]  ffff88047f873280 ffff88047f869898 ffffffff810871ff
ffffffff81dbbe02
[   82.776094]  ffff88047f873280 ffff88047f8698c8 ffffffff8107c1fa
0000000100000000
[   82.784399] Call Trace:
[   82.787129]  <#MC>  [<ffffffff81566db9>] dump_stack+0x46/0x58
[   82.793570]  [<ffffffff810871ff>] dequeue_task_idle+0x2f/0x40
[   82.799987]  [<ffffffff8107c1fa>] dequeue_task+0x5a/0x80
[   82.805921]  [<ffffffff810804f3>] deactivate_task+0x23/0x30
[   82.812145]  [<ffffffff81569050>] __schedule+0x580/0x7f0
[   82.818080]  [<ffffffff81569739>] schedule_preempt_disabled+0x29/0x70
[   82.825268]  [<ffffffff8156ac49>] __ww_mutex_lock_slowpath+0xeb/0x192
[   82.832464]  [<ffffffff8156ad43>] __ww_mutex_lock+0x53/0x85
[   82.838704]  [<ffffffffa00b6a5d>] drm_modeset_lock+0x3d/0x110 [drm]
[   82.845717]  [<ffffffffa00b6c2a>] __drm_modeset_lock_all+0x8a/0x120 [drm]
[   82.853311]  [<ffffffffa00b6cd0>] drm_modeset_lock_all+0x10/0x30 [drm]
[   82.860608]  [<ffffffffa01c68bf>]
drm_fb_helper_pan_display+0x2f/0xf0 [drm_kms_helper]
[   82.869453]  [<ffffffff8132bd21>] fb_pan_display+0xd1/0x1a0
[   82.875677]  [<ffffffff81326010>] bit_update_start+0x20/0x50
[   82.881997]  [<ffffffff813259f2>] fbcon_switch+0x3a2/0x550
[   82.888125]  [<ffffffff813a01c9>] redraw_screen+0x189/0x240
[   82.894343]  [<ffffffff81322f8a>] fbcon_blank+0x20a/0x2d0
[   82.900366]  [<ffffffff8137d359>] ? erst_writer+0x209/0x330
[   82.906591]  [<ffffffff810ba2f3>] ? internal_add_timer+0x63/0x80
[   82.913292]  [<ffffffff810bc137>] ? mod_timer+0x127/0x1e0
[   82.919323]  [<ffffffff813a0cd8>] do_unblank_screen+0xa8/0x1d0
[   82.925832]  [<ffffffff813a0e10>] unblank_screen+0x10/0x20
[   82.931951]  [<ffffffff812ca0d9>] bust_spinlocks+0x19/0x40
[   82.938070]  [<ffffffff81561ca7>] panic+0x106/0x1f5
[   82.943520]  [<ffffffff810232d0>] mce_panic+0x210/0x230
[   82.949357]  [<ffffffff812c796a>] ? delay_tsc+0x4a/0x80
[   82.955185]  [<ffffffff81024c55>] do_machine_check+0xa95/0xab0
[   82.961701]  [<ffffffff813365d7>] ? intel_idle+0xc7/0x150
[   82.967722]  [<ffffffff8156f13f>] machine_check+0x1f/0x30
[   82.973751]  [<ffffffff813365d7>] ? intel_idle+0xc7/0x150
[   82.979781]  <<EOE>>  [<ffffffff814283d5>] cpuidle_enter_state+0x55/0x170
[   82.987386]  [<ffffffff814285a7>] cpuidle_enter+0x17/0x20
[   82.993419]  [<ffffffff81097b08>] cpu_startup_entry+0x2d8/0x370
[   83.000031]  [<ffffffff8102fc39>] start_secondary+0x159/0x180

The problem is because kdump fails to load a new kernel, and we're
executing past crash_kexec() in panic(). And it calls
bust_spinlocks(0) which calls into the GPU driver trying to unblank
the screen, which eventually calls __schedule() while waiting for a
mutex to be released. But we're still in the machine check context.
The infinite stream of errors is because there's a for(;;) loop in
__mutex_lock_common(), so we enter __schedule() again and again.

Among them the offset 0x106 (panic+0x106) reveals that it's the second
bust_spinlocks() call inside panic(). It's after crash_kexec()
returns. Should be unreachable code if kdump works correctly.

So the bug is that bust_spinlocks(0) isn't safe to be called in
panic() (probably it used to be safe?). I changed it to
bust_spinlocks(1) and the problem is gone. Now when the machine check
panics, it says:

[  171.663161] Kernel panic - not syncing: Fatal Machine check
[  171.723760] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffff9fffffff)
[  171.735125] drm_kms_helper: panic occurred, switching back to text console
[  172.144582] Rebooting in 30 seconds..

And it reboots correctly after 30 seconds. But this may not be the
desired fix because we don't call the GPU driver to unblank the
screen. Here's the workaround:

diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..380d5c0 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -143,7 +143,7 @@ void panic(const char *fmt, ...)
         */
        crash_kexec(NULL);

-       bust_spinlocks(0);
+       bust_spinlocks(1);

        if (!panic_blink)
                panic_blink = no_blink;


There's probably no easy fix if we're to unblank the screen. Anyone
has any idea how this should be fixed?

Thanks
Rui

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Bug: Fatal errors result in infinite stream of error messages
  2014-12-03  9:11 Bug: Fatal errors result in infinite stream of error messages rui wang
@ 2014-12-03 20:52 ` Borislav Petkov
  2014-12-04 12:51   ` rui wang
  0 siblings, 1 reply; 3+ messages in thread
From: Borislav Petkov @ 2014-12-03 20:52 UTC (permalink / raw)
  To: rui wang; +Cc: linux-kernel, tony.luck, aris, rui.y.wang

On Wed, Dec 03, 2014 at 05:11:49PM +0800, rui wang wrote:
> The problem is because kdump fails to load a new kernel, and we're
> executing past crash_kexec() in panic(). And it calls
> bust_spinlocks(0) which calls into the GPU driver trying to unblank
> the screen, which eventually calls __schedule() while waiting for a
> mutex to be released. But we're still in the machine check context.
> The infinite stream of errors is because there's a for(;;) loop in
> __mutex_lock_common(), so we enter __schedule() again and again.

Hmm, there's a bust_spinlocks(1) call in mce_panic() for which I have no
idea what it is for? To stop us from scheduling?

If so, why doesn't it stop us...?

There's also this:

void console_unblank(void)
{
        struct console *c;

        /*
         * console_unblank can no longer be called in interrupt context unless
====>    * oops_in_progress is set to 1..
         */
        if (oops_in_progress) {
                if (down_trylock_console_sem() != 0)
                        return;
        } else
                console_lock();


-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Bug: Fatal errors result in infinite stream of error messages
  2014-12-03 20:52 ` Borislav Petkov
@ 2014-12-04 12:51   ` rui wang
  0 siblings, 0 replies; 3+ messages in thread
From: rui wang @ 2014-12-04 12:51 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, tony.luck, aris, rui.y.wang

On 12/4/14, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Dec 03, 2014 at 05:11:49PM +0800, rui wang wrote:
>> The problem is because kdump fails to load a new kernel, and we're
>> executing past crash_kexec() in panic(). And it calls
>> bust_spinlocks(0) which calls into the GPU driver trying to unblank
>> the screen, which eventually calls __schedule() while waiting for a
>> mutex to be released. But we're still in the machine check context.
>> The infinite stream of errors is because there's a for(;;) loop in
>> __mutex_lock_common(), so we enter __schedule() again and again.
>
> Hmm, there's a bust_spinlocks(1) call in mce_panic() for which I have no
> idea what it is for? To stop us from scheduling?
>
> If so, why doesn't it stop us...?
>
> There's also this:
>
> void console_unblank(void)
> {
>         struct console *c;
>
>         /*
>          * console_unblank can no longer be called in interrupt context
> unless
> ====>    * oops_in_progress is set to 1..
>          */
>         if (oops_in_progress) {
>                 if (down_trylock_console_sem() != 0)
>                         return;
>         } else
>                 console_lock();
>

That points to the direction of a potential fix. There are places
under drivers/gpu/drm/ which should do this kind of check in order to
make it work in panic. An existing example is here :

void drm_warn_on_modeset_not_all_locked(struct drm_device *dev)
{
        struct drm_crtc *crtc;

        /* Locking is currently fubar in the panic handler. */
        if (oops_in_progress) <======
                return;

        list_for_each_entry(crtc, &dev->mode_config.crtc_list, head)
                WARN_ON(!drm_modeset_is_locked(&crtc->mutex));

        WARN_ON(!drm_modeset_is_locked(&dev->mode_config.connection_mutex));
        WARN_ON(!mutex_is_locked(&dev->mode_config.mutex));
}

I'll send a patch later.

Thanks
Rui

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-12-04 12:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-03  9:11 Bug: Fatal errors result in infinite stream of error messages rui wang
2014-12-03 20:52 ` Borislav Petkov
2014-12-04 12:51   ` rui wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.