linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
@ 2020-12-22  9:13 Dexuan Cui
  2020-12-22 13:39 ` Valentin Schneider
  0 siblings, 1 reply; 3+ messages in thread
From: Dexuan Cui @ 2020-12-22  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, x86, linux-pm
  Cc: linux-kernel, linux-hyperv, Michael Kelley

Hi,
I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on Hyper-V.
When I test hibernation, the VM can easily hit the below BUG_ON during the resume
procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 vCPUs.

I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be broken?

In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:

7587 int sched_cpu_dying(unsigned int cpu)
7588 {
7589         struct rq *rq = cpu_rq(cpu);
7590         struct rq_flags rf;
7591
7592         /* Handle pending wakeups and then migrate everything off */
7593         sched_tick_stop(cpu);
7594
7595         rq_lock_irqsave(rq, &rf);
7596         BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
7597         rq_unlock_irqrestore(rq, &rf);
7598
7599         calc_load_migrate(rq);
7600         update_max_interval();
7601         nohz_balance_exit_idle(rq);
7602         hrtick_clear(rq);
7603         return 0;
7604 }

The last commit that touches the BUG_ON line is the commit
3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
but the commit looks good to me.

Any idea?

Thanks,
-- Dexuan

[    5.383378] PM: Image signature found, resuming
[    5.388027] PM: hibernation: resume from hibernation
[    5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done.
[    5.397058] OOM killer disabled.
[    5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[    5.413013] PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]
[    5.419331] PM: hibernation: Marking nosave pages: [mem 0x0009f000-0x000fffff]
[    5.425502] PM: hibernation: Marking nosave pages: [mem 0xf7ff0000-0xffffffff]
[    5.431706] PM: hibernation: Basic memory bitmaps created
[    5.465205] PM: Using 3 thread(s) for decompression
[    5.469590] PM: Loading and decompressing image data (505553 pages)...
[    5.492790] PM: Image loading progress:   0%
[    6.532641] PM: Image loading progress:  10%
[    6.899683] PM: Image loading progress:  20%
[    7.251672] PM: Image loading progress:  30%
[    7.598808] PM: Image loading progress:  40%
[    8.056153] PM: Image loading progress:  50%
[    8.534077] PM: Image loading progress:  60%
[    9.029886] PM: Image loading progress:  70%
[    9.542015] PM: Image loading progress:  80%
[   10.025326] PM: Image loading progress:  90%
[   10.525804] PM: Image loading progress: 100%
[   10.530241] PM: Image loading done
[   10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 MB/s)
[   10.540827] PM: Image successfully loaded
[   10.599243] serial 00:04: disabled
[   10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 should not be sleeping
[   10.646542] Disabling non-boot CPUs ...
[   10.694380] smpboot: CPU 1 is now offline
[   10.729044] smpboot: CPU 2 is now offline
[   10.756819] smpboot: CPU 3 is now offline
[   10.776784] smpboot: CPU 4 is now offline
[   10.800866] smpboot: CPU 5 is now offline
[   10.828923] smpboot: CPU 6 is now offline
[   10.849013] smpboot: CPU 7 is now offline
[   10.876722] smpboot: CPU 8 is now offline
[   10.909426] smpboot: CPU 9 is now offline
[   10.929360] smpboot: CPU 10 is now offline
[   10.957059] smpboot: CPU 11 is now offline
[   10.976542] smpboot: CPU 12 is now offline
[   11.008470] smpboot: CPU 13 is now offline
[   11.036356] smpboot: CPU 14 is now offline
[   11.072396] smpboot: CPU 15 is now offline
[   11.100229] smpboot: CPU 16 is now offline
[   11.128638] smpboot: CPU 17 is now offline
[   11.148479] smpboot: CPU 18 is now offline
[   11.173537] smpboot: CPU 19 is now offline
[   11.205680] smpboot: CPU 20 is now offline
[   11.231799] smpboot: CPU 21 is now offline
[   11.259562] smpboot: CPU 22 is now offline
[   11.288672] smpboot: CPU 23 is now offline
[   11.319691] smpboot: CPU 24 is now offline
[   11.355523] smpboot: CPU 25 is now offline
[   11.399469] smpboot: CPU 26 is now offline
[   11.435438] smpboot: CPU 27 is now offline
[   11.471402] smpboot: CPU 28 is now offline
[   11.515488] smpboot: CPU 29 is now offline
[   11.551355] smpboot: CPU 30 is now offline
[   11.591326] smpboot: CPU 31 is now offline
[   11.624004] smpboot: CPU 32 is now offline
[   11.659326] smpboot: CPU 33 is now offline
[   11.687478] smpboot: CPU 34 is now offline
[   11.719243] smpboot: CPU 35 is now offline
[   11.747252] smpboot: CPU 36 is now offline
[   11.771224] smpboot: CPU 37 is now offline
[   11.795089] ------------[ cut here ]------------
[   11.798052] kernel BUG at kernel/sched/core.c:7596!
[   11.798052] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: G            E     5.10.0+ #6
[   11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
[   11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0
[   11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0
[   11.798052] Code: 73 85 05 00 84 c0 75 12 48 83 c4 08 31 c0 5b c3 f0 48 01 05 af f4 7e ...
[   11.798052] RSP: 0000:ffffbb3c820bfde0 EFLAGS: 00010002
[   11.798052] RAX: 0000000000000082 RBX: ffff94f83acaac40 RCX: 0000000000000000
[   11.798052] RDX: 0000000000000001 RSI: 000000000000005a RDI: 0000000000000001
[   11.798052] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
[   11.798052] R10: 000000000000000b R11: ffffffff89cbed88 R12: ffffffff88aa7ed0
[   11.798052] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003
[   11.798052] FS:  0000000000000000(0000) GS:ffff94f83ac80000(0000) knlGS:0000000000000000
[   11.798052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.798052] CR2: 0000000000000000 CR3: 00000001144fa002 CR4: 00000000003706e0
[   11.798052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.798052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.798052] Call Trace:
[   11.798052]  ? rcutree_dying_cpu+0x12/0x30
[   11.798052]  cpuhp_invoke_callback+0x82/0x4a0
[   11.798052]  take_cpu_down+0x67/0xa0
[   11.798052]  multi_cpu_stop+0x64/0xf0
[   11.798052]  ? stop_machine_yield+0x10/0x10
[   11.798052]  cpu_stopper_thread+0x95/0x130
[   11.798052]  ? sort_range+0x20/0x20
[   11.798052]  smpboot_thread_fn+0x198/0x230
[   11.798052]  kthread+0x13d/0x160
[   11.798052]  ? kthread_create_on_node+0x60/0x60
[   11.798052]  ret_from_fork+0x22/0x30
[   11.798052] Modules linked in: hv_utils(E) cn(E) hid_generic(E) ...
[   11.798052] ---[ end trace 9f1a31ebcf9c45a1 ]---
[   11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0
[   11.798052] Code: 73 85 05 00 84 c0 75 12 48 83 c4 08 31 c0 5b c3 f0 48 01 05 af f4 7e ...
[   11.798052] RSP: 0000:ffffbb3c820bfde0 EFLAGS: 00010002
[   11.798052] RAX: 0000000000000082 RBX: ffff94f83acaac40 RCX: 0000000000000000
[   11.798052] RDX: 0000000000000001 RSI: 000000000000005a RDI: 0000000000000001
[   11.798052] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
[   11.798052] R10: 000000000000000b R11: ffffffff89cbed88 R12: ffffffff88aa7ed0
[   11.798052] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003
[   11.798052] FS:  0000000000000000(0000) GS:ffff94f83ac80000(0000) knlGS:0000000000000000
[   11.798052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.798052] CR2: 0000000000000000 CR3: 00000001144fa002 CR4: 00000000003706e0
[   11.798052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.798052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.798052] note: migration/38[202] exited with preempt_count 2


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
  2020-12-22  9:13 v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596! Dexuan Cui
@ 2020-12-22 13:39 ` Valentin Schneider
  2020-12-22 21:44   ` Dexuan Cui
  0 siblings, 1 reply; 3+ messages in thread
From: Valentin Schneider @ 2020-12-22 13:39 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, x86, linux-pm, linux-kernel,
	linux-hyperv, Michael Kelley


Hi,

On 22/12/20 09:13, Dexuan Cui wrote:
> Hi,
> I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on Hyper-V.
> When I test hibernation, the VM can easily hit the below BUG_ON during the resume
> procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 vCPUs.
>
> I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be broken?
>
> In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:
>
> 7587 int sched_cpu_dying(unsigned int cpu)
> 7588 {
> 7589         struct rq *rq = cpu_rq(cpu);
> 7590         struct rq_flags rf;
> 7591
> 7592         /* Handle pending wakeups and then migrate everything off */
> 7593         sched_tick_stop(cpu);
> 7594
> 7595         rq_lock_irqsave(rq, &rf);
> 7596         BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> 7597         rq_unlock_irqrestore(rq, &rf);
> 7598
> 7599         calc_load_migrate(rq);
> 7600         update_max_interval();
> 7601         nohz_balance_exit_idle(rq);
> 7602         hrtick_clear(rq);
> 7603         return 0;
> 7604 }
>
> The last commit that touches the BUG_ON line is the commit
> 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
> but the commit looks good to me.
>
> Any idea?
>

I'd wager this extra task is a kworker; could you give this series a try?

  https://lore.kernel.org/r/20201218170919.2950-1-jiangshanlai@gmail.com

> Thanks,
> -- Dexuan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
  2020-12-22 13:39 ` Valentin Schneider
@ 2020-12-22 21:44   ` Dexuan Cui
  0 siblings, 0 replies; 3+ messages in thread
From: Dexuan Cui @ 2020-12-22 21:44 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, x86, linux-pm, linux-kernel,
	linux-hyperv, Michael Kelley

> From: Valentin Schneider <valentin.schneider@arm.com>
> Sent: Tuesday, December 22, 2020 5:40 AM
> To: Dexuan Cui <decui@microsoft.com>
> Cc: mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> vincent.guittot@linaro.org; dietmar.eggemann@arm.com;
> rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de;
> bristot@redhat.com; x86@kernel.org; linux-pm@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; Michael Kelley
> <mikelley@microsoft.com>
> Subject: Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel
> BUG at kernel/sched/core.c:7596!
> 
> 
> Hi,
> 
> On 22/12/20 09:13, Dexuan Cui wrote:
> > Hi,
> > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020)
> on Hyper-V.
> > When I test hibernation, the VM can easily hit the below BUG_ON during the
> resume
> > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has
> 40 vCPUs.
> >
> > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may
> be broken?
> >
> > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and
> rq->nr_pinned==0:
> >
> > 7587 int sched_cpu_dying(unsigned int cpu)
> > 7588 {
> > 7589         struct rq *rq = cpu_rq(cpu);
> > 7590         struct rq_flags rf;
> > 7591
> > 7592         /* Handle pending wakeups and then migrate everything off
> */
> > 7593         sched_tick_stop(cpu);
> > 7594
> > 7595         rq_lock_irqsave(rq, &rf);
> > 7596         BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> > 7597         rq_unlock_irqrestore(rq, &rf);
> > 7598
> > 7599         calc_load_migrate(rq);
> > 7600         update_max_interval();
> > 7601         nohz_balance_exit_idle(rq);
> > 7602         hrtick_clear(rq);
> > 7603         return 0;
> > 7604 }
> >
> > The last commit that touches the BUG_ON line is the commit
> > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug
> cooperative")
> > but the commit looks good to me.
> >
> > Any idea?
> >
> 
> I'd wager this extra task is a kworker; could you give this series a try?
> 
> 
> https ://lore.kernel.org/lkml/20201218170919.2950-1-jiangshanlai@gmail.com/

Thanks, Valentin! It looks like the patchset can fix the BUG_ON, though I see
a warning, which I reported here: https://lkml.org/lkml/2020/12/22/648

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-12-22 21:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22  9:13 v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596! Dexuan Cui
2020-12-22 13:39 ` Valentin Schneider
2020-12-22 21:44   ` Dexuan Cui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).