linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Crash in fair scheduler
@ 2019-12-03  9:11 Schmid, Carsten
  2019-12-03 10:30 ` Valentin Schneider
  2019-12-03 10:30 ` Peter Zijlstra
  0 siblings, 2 replies; 12+ messages in thread
From: Schmid, Carsten @ 2019-12-03  9:11 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

Hi maintainers of the fair scheduler,

we had a crash in the fair scheduler and analysis shows that this could happen again.
Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).

I have collected relevant data from the core dump and put together with the source.
So i was able to explain why the crash happened, but as i am not an expert on this
scheduler stuff i can't tell if any of the data fed into the code path is ok or not.

In the analysis appended you can find the details; please check if we ran into a corner
case that nobody expected yet or if the data is somehow corrupted (i really can't tell, for me it looks fine).
Can you please check if this is a bug in the scheduler?

To give some "surrounding" info:
It looks like a task has slept for a long time, even through some suspend/resume cycles.
When this task had to wake up, the failure occured.

If you need more data from the core dump, let me know.

Best regards
Carsten

[-- Attachment #2: fair_scheduler_crash.txt --]
[-- Type: text/plain, Size: 11368 bytes --]

On kernel 4.14.86 we had a kernel crash in the fair scheduler.
Kernel 4.14.86 is from 4.14 LTS series, and current 4.14 has no change in the affected code path.
Additionally we compared v5.4-rc2 source, there are changes but these wouldn't affect the result
and lead to the same crash as a similar execution path would be taken.
So we concentrate on the 4.14.86 source here.

The callstack of the crash:
[28439.656848] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[28439.665662] IP: set_next_entity+0x17/0x490
[28439.670244] PGD 0 P4D 0 
[28439.673096] Oops: 0000 [#1] PREEMPT SMP NOPTI
[28439.677972] Modules linked in: bcmdhd(O) sr_mod cdrom tntfs(PO) texfat(PO) usb_storage configfs squashfs zlib_inflate xz_dec ah4 xfrm4_mode_transport ebt_ip ebtable_filter ebtables xfrm_user xfrm_algo cls_u32 sch_htb intel_tfm_governor ecryptfs cfq_iosched sd_mod snd_soc_apl_mgu_hu intel_ipu4_psys intel_ipu4_psys_csslib intel_xhci_usb_role_switch dwc3 roles udc_core adv728x snd_soc_skl sdw_cnl snd_soc_acpi_intel_match coretemp snd_soc_acpi sbi_apl intel_ipu4_isys snd_soc_core videobuf2_dma_contig snd_compress videobuf2_memops ipu4_acpi i2c_i801 intel_ipu4_isys_csslib videobuf2_v4l2 snd_soc_skl_ipc videobuf2_core ahci libahci sdw_bus xhci_pci libata xhci_hcd crc8 snd_soc_sst_ipc cfg80211 snd_soc_sst_dsp scsi_mod usbcore snd_hda_ext_core rfkill usb_common dwc3_pci snd_hda_core intel_ipu4_mmu snd_pcm
[28439.757185]  mei_me snd_timer intel_ipu4 mei snd iova soundcore nfsd auth_rpcgss lockd grace sunrpc loop fuse 8021q bridge stp llc inap560t(O) i915 video backlight intel_gtt i2c_algo_bit drm_kms_helper igb_avb(O) drm firmware_class ptp hwmon spi_pxa2xx_platform pps_core [last unloaded: bcmdhd]
[28439.786334] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P     U     O    4.14.86-apl #1
[28439.794807] task: ffff99a97895a580 task.stack: ffffb21000078000
[28439.801430] RIP: 0010:set_next_entity+0x17/0x490
[28439.806597] RSP: 0018:ffffb2100007bdb0 EFLAGS: 00010082
[28439.812444] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[28439.820428] RDX: 00000000000003e1 RSI: 0000000000000000 RDI: ffff99a96dda9800
[28439.828408] RBP: ffffb2100007bdf0 R08: 0000000000000000 R09: 0000000000000000
[28439.836395] R10: 0000000000000000 R11: 0000042df10058da R12: ffff99a97895a580
[28439.844383] R13: ffff99a96dda9800 R14: 0000000000000000 R15: ffff99a96dda9800
[28439.852371] FS:  0000000000000000(0000) GS:ffff99a97fd80000(0000) knlGS:0000000000000000
[28439.861429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28439.867864] CR2: 0000000000000038 CR3: 00000001f7a64000 CR4: 00000000003406a0
[28439.875852] Call Trace:
[28439.878590]  pick_next_task_fair+0x517/0x560
[28439.883369]  __schedule+0xfa/0x7b0
[28439.887172]  schedule_idle+0x1d/0x40
[28439.891170]  do_idle+0x134/0x1a0
[28439.894779]  cpu_startup_entry+0x6e/0x70
[28439.899168]  start_secondary+0x1a7/0x1f0
[28439.903555]  secondary_startup_64+0xa5/0xb0
[28439.908233] Code: 70 44 29 f6 41 89 d5 e8 58 f9 ff ff e9 11 ff ff ff 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 89 f3 48 83 ec 18 <8b> 4e 38 85 c9 75 3d 4d 8b a5 d0 00 00 00 41 83 bc 24 f8 08 00 
[28439.929423] RIP: set_next_entity+0x17/0x490 RSP: ffffb2100007bdb0
[28439.936238] CR2: 0000000000000038

A dump of cfs_rq which is used here (a bit compressed to ease reading):
crash> * cfs_rq ffff99a96dda9800
struct cfs_rq {
  load = {  weight = 1048576,  inv_weight = 0  }, 
  nr_running = 1, 
  h_nr_running = 1, 
  exec_clock = 0, 
  min_vruntime = 190894920101, 
  tasks_timeline = {  rb_root = {    rb_node = 0xffff99a9502e0d10  },   rb_leftmost = 0x0  }, 
  curr = 0x0, 
  next = 0x0, 
  last = 0x0, 
  skip = 0x0, 
  nr_spread_over = 0, 
  avg = { last_update_time = 28439656814592,  load_sum = 7948862,  util_sum = 1838088,
          period_contrib = 278,  load_avg = 167,  util_avg = 37  }, 
  runnable_load_sum = 5920382, 
  runnable_load_avg = 124, 
  tg_load_avg_contrib = 167, 
  propagate_avg = 0, 
  removed_load_avg = { counter = 0 }, 
  removed_util_avg = { counter = 0 }, 
  h_load = 26, 
  last_h_load_update = 4323103347, 
  h_load_next = 0x0, 
  rq = 0xffff99a97fd9f500, 
  on_list = 1, 
  leaf_cfs_rq_list = { next = 0xffff99a9782ad4e0, prev = 0xffff99a9779bece0 }, 
  tg = 0xffff99a977b35d40, 
  runtime_enabled = 0, 
  runtime_expires = 0, 
  runtime_remaining = 0, 
  throttled_clock = 0, 
  throttled_clock_task = 0, 
  throttled_clock_task_time = 0, 
  throttled = 0, 
  throttle_count = 0, 
  throttled_list = { next = 0xffff99a96dda9930, prev = 0xffff99a96dda9930 }
}

and the complete stack frame of the crash (with some comments added, removed the crash related TOS):
crash> bt -FF
PID: 0      TASK: ffff99a97895a580  CPU: 3   COMMAND: "swapper/3"
 #0 [ffffb2100007ba20] machine_kexec at ffffffffaa035a5f
--- snip ---
 #9 [ffffb2100007bd00] page_fault at ffffffffaa601215
    [exception RIP: set_next_entity+23]
    RIP: ffffffffaa0ad2e7  RSP: ffffb2100007bdb0  RFLAGS: 00010082
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000000003e1  RSI: 0000000000000000  RDI: ffff99a96dda9800
    RBP: ffffb2100007bdf0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000042df10058da  R12: ffff99a97895a580
    R13: ffff99a96dda9800  R14: 0000000000000000  R15: ffff99a96dda9800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    ffffb2100007bd08: [ffff99a96dda9800:kmalloc-512] 0000000000000000 
    ffffb2100007bd18: [ffff99a96dda9800:kmalloc-512] [ffff99a97895a580:task_struct] 
    ffffb2100007bd28: ffffb2100007bdf0 0000000000000000 
    ffffb2100007bd38: 0000042df10058da 0000000000000000 
    ffffb2100007bd48: 0000000000000000 0000000000000000 
    ffffb2100007bd58: 0000000000000000 0000000000000000 
    ffffb2100007bd68: 00000000000003e1 0000000000000000 
    ffffb2100007bd78: [ffff99a96dda9800:kmalloc-512] ffffffffffffffff 
    ffffb2100007bd88: set_next_entity+23 0000000000000010 
    ffffb2100007bd98: 0000000000010082 ffffb2100007bdb0 
    ffffb2100007bda8: 0000000000000018 0000000000000002 
    ffffb2100007bdb8: 0000000000000000 0000000000000000 
    ffffb2100007bdc8: ffff99a97fd9f500 [ffff99a97895a580:task_struct] 
    ffffb2100007bdd8: [ffff99a97895aab8:task_struct] 0000000000000000 
    ffffb2100007bde8: [ffff99a96dda9800:kmalloc-512] ffffb2100007be60:rbp 
    ffffb2100007bdf8: pick_next_task_fair+1303 
    
#10 [ffffb2100007bdf8] pick_next_task_fair at ffffffffaa0b5457
    ffffb2100007be00: ffffb2100007be10 
                      0000000000000018 
    ffffb2100007be10: ffff99a97fd9f560 QWORD PTR gs:0x28
                      ffffb2100007be80 
    ffffb2100007be20: [ffff99a97895a580:task_struct] saved param *prev
                      ffffb2100007be48
    ffffb2100007be30: f048ce7809464d00 ffff99a97fd9f500=rbx 
    ffffb2100007be40: [ffff99a97895a580:task_struct]=r12 [ffff99a97895aab8:task_struct]=r13 
    ffffb2100007be50: 0000000000000000=r14  0000000000000000=r15  
    ffffb2100007be60: ffffb2100007bec0=rbp  __schedule+250   
    

The failing piece of code is in kernel/sched/fair.c (Line 3873, Kernel 4.14.86; Line 4153 Kernel v5.4-rc2), because se is NULL:
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/* 'current' is not kept within the tree. */
	if (se->on_rq) { <<<<<<< crash here

set_next_entity is called from within pick_next_task_fair, from the following piece of code:
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	if (!cfs_rq->nr_running) // this is 1, so we are not going to idle
		goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
	if (prev->sched_class != &fair_sched_class) <<<<< this is true:
							crash> p &fair_sched_class
							$1 = (const struct sched_class *) 0xffffffffaaa10cc0 <<<<
							crash> $prev=ffff99a97895a580
							crash> gdb set $prev=(struct task_struct *)0xffff99a97895a580
							crash> p $prev->sched_class
							$2 = (const struct sched_class *) 0xffffffffaaa10b40 <<<<
		goto simple; <<<< so we go to simple
....
(Line 6360, Kernel 4.14.86; Line 6820 Kernel v5.4-rc2)
simple:
#endif

	put_prev_task(rq, prev);

	do {
		se = pick_next_entity(cfs_rq, NULL); <<<< this returns se=NULL
		set_next_entity(cfs_rq, se); <<<<<<<< here we crash
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

So why is se = NULL returned?


Looking at pick_next_entity and cfs_rq fed into it together with curr = NULL:
(Line 3915, Kernel 4.14.86; Line 4196 Kernel v5.4-rc2)
/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	struct sched_entity *left = __pick_first_entity(cfs_rq); <<<< returns NULL
	struct sched_entity *se;

&cfs_rq->tasks_timeline->rb_leftmost
  tasks_timeline = {
    rb_root = {
      rb_node = 0xffff99a9502e0d10
    }, 
    rb_leftmost = 0x0
  }, 
include/linux/rbtree.h:91:#define rb_first_cached(root) (root)->rb_leftmost
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);

	if (!left)
		return NULL; <<<<<<<<<< the case

	return rb_entry(left, struct sched_entity, run_node);
}

Back in pick_next_entity, we now have
curr = NULL
left = NULL

	/*
	 * If curr is set we have to see if its left of the leftmost entity
	 * still in the tree, provided there was anything in the tree at all.
	 */
	if (!left <<< this is the case here >>>> || (curr && entity_before(curr, left)))
		left = curr; <<<< so we still get left = curr = NULL

	se = left; /* ideally we run the leftmost entity */ <<<< but we now have se = NULL ????

	/*
	 * Avoid running the skip buddy, if running something else can
	 * be done without getting too unfair.
	 */
	if (cfs_rq->skip == se) { <<<< that's equal here, se = NULL = 0x0 = cfs_rq->skip
		struct sched_entity *second;

		if (se == curr) { <<<< and this too, NULL = NULL
			second = __pick_first_entity(cfs_rq); <<<< so we get again NULL like we did for *left
		} else {
			second = __pick_next_entity(se);
			if (!second || (curr && entity_before(curr, second)))
				second = curr;
		}

		if (second && wakeup_preempt_entity(second, left) < 1) <<<< as second is also NULL nothing is done here, se unchanged
			se = second;
	}

	/*
	 * Prefer last buddy, try to return the CPU to a preempted task.
	 */
	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) <<<< cfs_rq->last = 0x0, not true, se unchanged
		se = cfs_rq->last;

	/*
	 * Someone really wants this to run. If it's not unfair, run it.
	 */
	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) <<<< cfs_rq->next = 0x0, not true, se unchanged
		se = cfs_rq->next;

	<<<< we reach this point with se = NULL >>>>
	clear_buddies(cfs_rq, se); <<<< this doesn't change anything in se

	return se; <<<< so we get se=NULL as a result.
}

Is this a corner case nobody thought of or do we have cfs_rq data that is unexpected in it's content?




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03  9:11 Crash in fair scheduler Schmid, Carsten
@ 2019-12-03 10:30 ` Valentin Schneider
  2019-12-03 10:40   ` Dietmar Eggemann
  2019-12-03 10:30 ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Valentin Schneider @ 2019-12-03 10:30 UTC (permalink / raw)
  To: Schmid, Carsten, mingo, peterz, linux-kernel

On 03/12/2019 09:11, Schmid, Carsten wrote:
[...]

> set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> 	/* 'current' is not kept within the tree. */
> 	if (se->on_rq) { <<<<<<< crash here
> 
> set_next_entity is called from within pick_next_task_fair, from the following piece of code:
> static struct task_struct *
> pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> 	struct cfs_rq *cfs_rq = &rq->cfs;
> 	struct sched_entity *se;
> 	struct task_struct *p;
> 	int new_tasks;
> 
> again:
> 	if (!cfs_rq->nr_running) // this is 1, so we are not going to idle
> 		goto idle;
> 
> #ifdef CONFIG_FAIR_GROUP_SCHED
> 	if (prev->sched_class != &fair_sched_class) <<<<< this is true:
> 							crash> p &fair_sched_class
> 							$1 = (const struct sched_class *) 0xffffffffaaa10cc0 <<<<
> 							crash> $prev=ffff99a97895a580
> 							crash> gdb set $prev=(struct task_struct *)0xffff99a97895a580
> 							crash> p $prev->sched_class
> 							$2 = (const struct sched_class *) 0xffffffffaaa10b40 <<<<
> 		goto simple; <<<< so we go to simple
> ....
> (Line 6360, Kernel 4.14.86; Line 6820 Kernel v5.4-rc2)
> simple:
> #endif
> 
> 	put_prev_task(rq, prev);
> 
> 	do {
> 		se = pick_next_entity(cfs_rq, NULL); <<<< this returns se=NULL
> 		set_next_entity(cfs_rq, se); <<<<<<<< here we crash
> 		cfs_rq = group_cfs_rq(se);
> 	} while (cfs_rq);
> 
> So why is se = NULL returned?


That looks a lot like a recent issue we've had, see

  https://lore.kernel.org/lkml/20191108131909.428842459@infradead.org/

The issue is caused by
  
  67692435c411 ("sched: Rework pick_next_task() slow-path")

which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
shouldn't, unless the kernel you're using has had core scheduling somehow
backported to it?

I've only scraped the surface but I'd like to first ask: can you reproduce
the issue on v5.4 final ?

> Best regards
> Carsten
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03  9:11 Crash in fair scheduler Schmid, Carsten
  2019-12-03 10:30 ` Valentin Schneider
@ 2019-12-03 10:30 ` Peter Zijlstra
  2019-12-03 10:51   ` AW: " Schmid, Carsten
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-12-03 10:30 UTC (permalink / raw)
  To: Schmid, Carsten; +Cc: mingo, linux-kernel

On Tue, Dec 03, 2019 at 09:11:14AM +0000, Schmid, Carsten wrote:
> Hi maintainers of the fair scheduler,
> 
> we had a crash in the fair scheduler and analysis shows that this could happen again.
> Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).

Please, do try if you can reproduce with Linus' latest git. I've no idea
what is, or is not, in those stable trees.

> crash> * cfs_rq ffff99a96dda9800
> struct cfs_rq {
>   load = {  weight = 1048576,  inv_weight = 0  }, 
>   nr_running = 1, 
>   h_nr_running = 1, 
>   exec_clock = 0, 
>   min_vruntime = 190894920101, 
>   tasks_timeline = {  rb_root = {    rb_node = 0xffff99a9502e0d10  },   rb_leftmost = 0x0  }, 
>   curr = 0x0, 
>   next = 0x0, 
>   last = 0x0, 
>   skip = 0x0, 


> &cfs_rq->tasks_timeline->rb_leftmost
>   tasks_timeline = {
>     rb_root = {
>       rb_node = 0xffff99a9502e0d10
>     }, 
>     rb_leftmost = 0x0
>   }, 

> include/linux/rbtree.h:91:#define rb_first_cached(root) (root)->rb_leftmost

> struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> {
> 	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
> 
> 	if (!left)
> 		return NULL; <<<<<<<<<< the case
> 
> 	return rb_entry(left, struct sched_entity, run_node);
> }

This the problem, for some reason the rbtree code got that rb_leftmost
thing wrecked.

> Is this a corner case nobody thought of or do we have cfs_rq data that is unexpected in it's content?

No, the rbtree is corrupt. Your tree has a single node (which matches
with nr_running), but for some reason it thinks rb_leftmost is NULL.
This is wrong, if the tree is non-empty, it must have a leftmost
element.

Can you reproduce at will? If so, can you please try the latest kernel,
and or share the reproducer?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03 10:30 ` Valentin Schneider
@ 2019-12-03 10:40   ` Dietmar Eggemann
  2019-12-03 11:09     ` Valentin Schneider
  0 siblings, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2019-12-03 10:40 UTC (permalink / raw)
  To: Valentin Schneider, Schmid, Carsten, mingo, peterz, linux-kernel

On 03/12/2019 11:30, Valentin Schneider wrote:
> On 03/12/2019 09:11, Schmid, Carsten wrote:

[...]

> That looks a lot like a recent issue we've had, see
> 
>   https://lore.kernel.org/lkml/20191108131909.428842459@infradead.org/
> 
> The issue is caused by
>   
>   67692435c411 ("sched: Rework pick_next_task() slow-path")
> 
> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
> shouldn't, unless the kernel you're using has had core scheduling somehow
> backported to it?
> 
> I've only scraped the surface but I'd like to first ask: can you reproduce
> the issue on v5.4 final ?

Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
slow-path").

^ permalink raw reply	[flat|nested] 12+ messages in thread

* AW: Crash in fair scheduler
  2019-12-03 10:30 ` Peter Zijlstra
@ 2019-12-03 10:51   ` Schmid, Carsten
  2019-12-03 14:01     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Schmid, Carsten @ 2019-12-03 10:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel

> > we had a crash in the fair scheduler and analysis shows that this could
> happen again.
> > Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2
> (and 4.14.147 too).
> 
> Please, do try if you can reproduce with Linus' latest git. I've no idea
> what is, or is not, in those stable trees.
> 
unfortunately a once issue so far ...


--- snip ---

> > include/linux/rbtree.h:91:#define rb_first_cached(root) (root)-
> >rb_leftmost
> 
> > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> > {
> > 	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
> >
> > 	if (!left)
> > 		return NULL; <<<<<<<<<< the case
> >
> > 	return rb_entry(left, struct sched_entity, run_node);
> > }
> 
> This the problem, for some reason the rbtree code got that rb_leftmost
> thing wrecked.
> 
Any known issue on rbtree code regarding this?

> > Is this a corner case nobody thought of or do we have cfs_rq data that is
> unexpected in it's content?
> 
> No, the rbtree is corrupt. Your tree has a single node (which matches
> with nr_running), but for some reason it thinks rb_leftmost is NULL.
> This is wrong, if the tree is non-empty, it must have a leftmost
> element.
Is there a chance to find the left-most element in the core dump?
Maybe i can dig deeper to find the root c ause then.
Does any of the structs/data in this context point to some memory
where i can continue to search?
Where should rb_leftmost point to if only one node is in the tree?
To the node itself?

> 
> Can you reproduce at will? If so, can you please try the latest kernel,
> and or share the reproducer?
Unfortunately this was a "once" issue so far; i haven't a reproducer yet.

Thanks,
Carsten

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03 10:40   ` Dietmar Eggemann
@ 2019-12-03 11:09     ` Valentin Schneider
  2019-12-03 15:08       ` Dietmar Eggemann
  0 siblings, 1 reply; 12+ messages in thread
From: Valentin Schneider @ 2019-12-03 11:09 UTC (permalink / raw)
  To: Dietmar Eggemann, Schmid, Carsten, mingo, peterz, linux-kernel

On 03/12/2019 10:40, Dietmar Eggemann wrote:
> On 03/12/2019 11:30, Valentin Schneider wrote:
>> On 03/12/2019 09:11, Schmid, Carsten wrote:
> 
> [...]
> 
>> That looks a lot like a recent issue we've had, see
>>
>>   https://lore.kernel.org/lkml/20191108131909.428842459@infradead.org/
>>
>> The issue is caused by
>>   
>>   67692435c411 ("sched: Rework pick_next_task() slow-path")
>>
>> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
>> shouldn't, unless the kernel you're using has had core scheduling somehow
>> backported to it?
>>
>> I've only scraped the surface but I'd like to first ask: can you reproduce
>> the issue on v5.4 final ?
> 
> Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
> slow-path").
> 

Right, which is why I wondered if the kernel under test had had that 
backported to it for some reason (e.g. core scheduling). Peter pointed out
that this is a slightly different issue (nr_running matches the rbt), so
this is probably unrelated.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03 10:51   ` AW: " Schmid, Carsten
@ 2019-12-03 14:01     ` Peter Zijlstra
  2019-12-05 10:56       ` AW: " Schmid, Carsten
  2019-12-05 17:41       ` Davidlohr Bueso
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-12-03 14:01 UTC (permalink / raw)
  To: Schmid, Carsten; +Cc: mingo, linux-kernel, walken, dave

On Tue, Dec 03, 2019 at 10:51:46AM +0000, Schmid, Carsten wrote:

> > > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> > > {
> > > 	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
> > >
> > > 	if (!left)
> > > 		return NULL; <<<<<<<<<< the case
> > >
> > > 	return rb_entry(left, struct sched_entity, run_node);
> > > }
> > 
> > This the problem, for some reason the rbtree code got that rb_leftmost
> > thing wrecked.
> > 
> Any known issue on rbtree code regarding this?

I don't recall ever having seen this before. :/ Adding Davidlohr and
Michel who've poked at the rbtree code 'recently'.

> > > Is this a corner case nobody thought of or do we have cfs_rq data that is
> > unexpected in it's content?
> > 
> > No, the rbtree is corrupt. Your tree has a single node (which matches
> > with nr_running), but for some reason it thinks rb_leftmost is NULL.
> > This is wrong, if the tree is non-empty, it must have a leftmost
> > element.
> Is there a chance to find the left-most element in the core dump?

If there is only one entry in the tree, then that must also be the
leftmost entry. See your own later question :-)

> Maybe i can dig deeper to find the root c ause then.
> Does any of the structs/data in this context point to some memory
> where i can continue to search?

There are only two places where rb_leftmost are updated,
rb_insert_color_cached() and rb_erase_cached() (the scheduler does not
use rb_replace_nod_cached).

We can 'forget' to set leftmost on insertion if @leftmost is somehow
false, and we can eroneously clear leftmost on erase if rb_next()
malfunctions.

No clues on which of those two cases happened.

> Where should rb_leftmost point to if only one node is in the tree?
> To the node itself?

Exatly.


I suppose one approach is to add code to both __enqueue_entity() and
__dequeue_entity() that compares ->rb_leftmost to the result of
rb_first(). That'd incur some overhead but it'd double check the logic.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03 11:09     ` Valentin Schneider
@ 2019-12-03 15:08       ` Dietmar Eggemann
  2019-12-03 15:57         ` AW: " Schmid, Carsten
  0 siblings, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2019-12-03 15:08 UTC (permalink / raw)
  To: Valentin Schneider, Schmid, Carsten, mingo, peterz, linux-kernel

On 03/12/2019 12:09, Valentin Schneider wrote:
> On 03/12/2019 10:40, Dietmar Eggemann wrote:
>> On 03/12/2019 11:30, Valentin Schneider wrote:
>>> On 03/12/2019 09:11, Schmid, Carsten wrote:
>>
>> [...]
>>
>>> That looks a lot like a recent issue we've had, see
>>>
>>>   https://lore.kernel.org/lkml/20191108131909.428842459@infradead.org/
>>>
>>> The issue is caused by
>>>   
>>>   67692435c411 ("sched: Rework pick_next_task() slow-path")
>>>
>>> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
>>> shouldn't, unless the kernel you're using has had core scheduling somehow
>>> backported to it?
>>>
>>> I've only scraped the surface but I'd like to first ask: can you reproduce
>>> the issue on v5.4 final ?
>>
>> Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
>> slow-path").
>>
> 
> Right, which is why I wondered if the kernel under test had had that 
> backported to it for some reason (e.g. core scheduling). Peter pointed out
> that this is a slightly different issue (nr_running matches the rbt), so
> this is probably unrelated.

I can't reproduce it on Arm64 Juno running 4.14.86. I suppose that there
is no extra reproducer testcase since the issue happened with
prev->sched_class eq. &idle_sched_class [prev eq. swapper/X 0] in the
simple path of pick_next_task_fair().

I'm running with CONFIG_SCHED_AUTOGROUP=y and CONFIG_FAIR_GROUP_SCHED=y
some taskgroup related tests for hours now. So the sched_entity (se) can
be a task, an autogroup or a taskgroup in the simple path. pref is
either swapper/X or migration/X.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* AW: Crash in fair scheduler
  2019-12-03 15:08       ` Dietmar Eggemann
@ 2019-12-03 15:57         ` Schmid, Carsten
  0 siblings, 0 replies; 12+ messages in thread
From: Schmid, Carsten @ 2019-12-03 15:57 UTC (permalink / raw)
  To: Dietmar Eggemann, Valentin Schneider, mingo, peterz, linux-kernel

> On 03/12/2019 12:09, Valentin Schneider wrote:
> > On 03/12/2019 10:40, Dietmar Eggemann wrote:
> >> On 03/12/2019 11:30, Valentin Schneider wrote:
> >>> On 03/12/2019 09:11, Schmid, Carsten wrote:
> >>
> >> [...]
> 
> I can't reproduce it on Arm64 Juno running 4.14.86. I suppose that there
> is no extra reproducer testcase since the issue happened with
> prev->sched_class eq. &idle_sched_class [prev eq. swapper/X 0] in the
> simple path of pick_next_task_fair().
> 
> I'm running with CONFIG_SCHED_AUTOGROUP=y and
> CONFIG_FAIR_GROUP_SCHED=y
> some taskgroup related tests for hours now. So the sched_entity (se) can
> be a task, an autogroup or a taskgroup in the simple path. pref is
> either swapper/X or migration/X.

We have the same kernel config settings.
However, as i stated in the analysis, we had
prev->sched_class ne. &fair_sched_class
and, unfortunately, no reproducer.

Looks like we need to find out why rb_leftmost is 0x0/NULL.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* AW: Crash in fair scheduler
  2019-12-03 14:01     ` Peter Zijlstra
@ 2019-12-05 10:56       ` Schmid, Carsten
  2019-12-05 17:41       ` Davidlohr Bueso
  1 sibling, 0 replies; 12+ messages in thread
From: Schmid, Carsten @ 2019-12-05 10:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, walken, dave

> Von: Peter Zijlstra [mailto:peterz@infradead.org]

> 
> Exatly.
> 
> 
> I suppose one approach is to add code to both __enqueue_entity() and
> __dequeue_entity() that compares ->rb_leftmost to the result of
> rb_first(). That'd incur some overhead but it'd double check the logic.

As this is a ONCE without reproducer, i would prefer to use an approach
to exactly check for this case in the code path where it crashed.
Something like this (with pseudo-code):

simple:
....

do {
  se = pick_next_entity(..)
  if (unlikely(!se)) { /* here we check for the issue */
     write warning and some useful data to dmesg
     if (cur_rq->rb_leftmost == NULL) { /* our case */
       set cur_rq->rb_leftmost to itself as mentioned in the discussion
       se = pick_next_entity(..)       /* should now return a valid pointer */
     } else { /* another case happened, unknown */
        write warning to dmesg UNKNOWN
        panic() /* not known what to do here, would crash anyway. */
     }
  set_next_entity(se, ..)
  cfs_rq = group_cfs_rq(...)
} while (cfs_rq);

This will definitely not fix the rb_leftmost being NULL, but we can't tell
where this happened at all, so it's digging in the dark.
Maybe the data written to dmesg will help to diagnose further, if the issue
will happen again.
And, this will not affect performance much, as i have to take care of this too.

Thanks for all your suggestions.
Carsten

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Crash in fair scheduler
  2019-12-03 14:01     ` Peter Zijlstra
  2019-12-05 10:56       ` AW: " Schmid, Carsten
@ 2019-12-05 17:41       ` Davidlohr Bueso
  1 sibling, 0 replies; 12+ messages in thread
From: Davidlohr Bueso @ 2019-12-05 17:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Schmid, Carsten, mingo, linux-kernel, walken

On Tue, 03 Dec 2019, Peter Zijlstra wrote:

>On Tue, Dec 03, 2019 at 10:51:46AM +0000, Schmid, Carsten wrote:
>
>> > > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>> > > {
>> > >	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
>> > >
>> > >	if (!left)
>> > >		return NULL; <<<<<<<<<< the case
>> > >
>> > >	return rb_entry(left, struct sched_entity, run_node);
>> > > }
>> >
>> > This the problem, for some reason the rbtree code got that rb_leftmost
>> > thing wrecked.
>> >
>> Any known issue on rbtree code regarding this?
>
>I don't recall ever having seen this before. :/ Adding Davidlohr and
>Michel who've poked at the rbtree code 'recently'.

Yeah I had never seen this either, and would expect the world to fall
appart if leftmost is buggy (much less a one time occurance), but the
following certainly raises a red flag:

    &cfs_rq->tasks_timeline->rb_leftmost
  tasks_timeline = {
    rb_root = {
      rb_node = 0xffff99a9502e0d10
    },
    rb_leftmost = 0x0
  },

>
>> > > Is this a corner case nobody thought of or do we have cfs_rq data that is
>> > unexpected in it's content?
>> >
>> > No, the rbtree is corrupt. Your tree has a single node (which matches
>> > with nr_running), but for some reason it thinks rb_leftmost is NULL.
>> > This is wrong, if the tree is non-empty, it must have a leftmost
>> > element.
>> Is there a chance to find the left-most element in the core dump?
>
>If there is only one entry in the tree, then that must also be the
>leftmost entry. See your own later question :-)
>
>> Maybe i can dig deeper to find the root c ause then.
>> Does any of the structs/data in this context point to some memory
>> where i can continue to search?
>
>There are only two places where rb_leftmost are updated,
>rb_insert_color_cached() and rb_erase_cached() (the scheduler does not
>use rb_replace_nod_cached).
>
>We can 'forget' to set leftmost on insertion if @leftmost is somehow
>false, and we can eroneously clear leftmost on erase if rb_next()
>malfunctions.
>
>No clues on which of those two cases happened.

For a bug in insertion I'm certainly not seeing it: we only call
insert into tasks_timeline in __enqueue_entity()... this is the standard
way of using the api, and cannot see how leftmost would become false
unless we take at least one path to the right while going down the tree.

For the erase case, this is more involved than insertion (rb_next()),
but this has not changed in years.

Fwiw, there have been three flavors of the leftmost pointer caching:

The first is the one the scheduler used by itself.

The second is when we moved the logic into the rbtree cached api.
bfb068892d3 (sched/fair: replace cfs_rq->rb_leftmost)

The third was the recent simplifications and cleanups from Michel,
which took out the caching checks into rbtree.h, instead of it being
passed down to the internal functions that actually do the insert/delete.
9f973cb3808 (lib/rbtree: avoid generating code twice for the cached versions)

Specifically looking at 4.14.86, it is using the bfb068892d3 changes.

Note how all three use the same logic to replace the rb_leftmost pointer.

>
>> Where should rb_leftmost point to if only one node is in the tree?
>> To the node itself?
>
>Exatly.
>
>
>I suppose one approach is to add code to both __enqueue_entity() and
>__dequeue_entity() that compares ->rb_leftmost to the result of
>rb_first(). That'd incur some overhead but it'd double check the logic.

We could benefit from improved debugging in rbtrees, not only the cached
flavor. Perhaps we can start with the following -- this would at least
let us know if the case where the tree is non-empty and leftmost is nil
was hit, whether in the scheduler or another user...

Thanks,
Davidlohr

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 1fd61a9af45c..b4b4df3ad0fc 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -130,7 +130,28 @@ struct rb_root_cached {
 #define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }

 /* Same as rb_first(), but O(1) */
-#define rb_first_cached(root) (root)->rb_leftmost
+#define __rb_first_cached(root) (root)->rb_leftmost
+
+#ifndef CONFIG_RBTREE_DEBUG
+# define rb_first_cached(root) __rb_first_cached(root)
+# define rbtree_cached_debug(root) do { } while(0)
+
+#else
+static inline struct rb_node *rb_first_cached(struct rb_root_cached *root)
+{
+	struct rb_node *leftmost = __rb_first_cached(root);
+
+	WARN_ON(leftmost != rb_first(&root->rb_root));
+	return leftmost;
+}
+
+#define rbtree_cached_debug(root)					\
+do {									\
+	WARN_ON(rb_first(&(root)->rb_root) != __rb_first_cached((root)));	\
+	WARN_ON(!RB_EMPTY_ROOT(&(root)->rb_root) && !__rb_first_cached((root))); \
+	WARN_ON(RB_EMPTY_ROOT(&(root)->rb_root) && __rb_first_cached((root))); \
+} while (0)
+#endif /* CONFIG_RBTREE_DEBUG */

 static inline void rb_insert_color_cached(struct rb_node *node,
					  struct rb_root_cached *root,
@@ -139,6 +160,8 @@ static inline void rb_insert_color_cached(struct rb_node *node,
	if (leftmost)
		root->rb_leftmost = node;
	rb_insert_color(node, &root->rb_root);
+
+	rbtree_cached_debug(root);
 }

 static inline void rb_erase_cached(struct rb_node *node,
@@ -147,6 +170,8 @@ static inline void rb_erase_cached(struct rb_node *node,
	if (root->rb_leftmost == node)
		root->rb_leftmost = rb_next(node);
	rb_erase(node, &root->rb_root);
+
+	rbtree_cached_debug(root);
 }

 static inline void rb_replace_node_cached(struct rb_node *victim,
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2f6fb96405af..62ab9f978bc6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1727,6 +1727,16 @@ config BACKTRACE_SELF_TEST

	  Say N if you are unsure.

+config RBTREE_DEBUG
+	bool "Red-Black tree sanity tests"
+	depends on DEBUG_KERNEL
+	help
+	  This option enables runtime sanity checks on all variants
+	  of the rbtree library. Doing so can cause significant overhead,
+	  so only enable it in non-production environments.
+
+	  Say N if you are unsure.
+
 config RBTREE_TEST
	tristate "Red-Black tree test"
	depends on DEBUG_KERNEL

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* AW: Crash in fair scheduler
@ 2019-12-06 10:11 Schmid, Carsten
  0 siblings, 0 replies; 12+ messages in thread
From: Schmid, Carsten @ 2019-12-06 10:11 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra; +Cc: mingo, linux-kernel, walken

> Von: Davidlohr Bueso [mailto:dave@stgolabs.net]
> Gesendet: Donnerstag, 5. Dezember 2019 18:41
> 
> Yeah I had never seen this either, and would expect the world to fall
> appart if leftmost is buggy (much less a one time occurance), but the
> following certainly raises a red flag:
> 
>     &cfs_rq->tasks_timeline->rb_leftmost
>   tasks_timeline = {
>     rb_root = {
>       rb_node = 0xffff99a9502e0d10
>     },
>     rb_leftmost = 0x0
>   },
> 
Meanwhile i am diving a bit deeper into the kernel dump.
I can see that for this rb_root we have a node structure with 2 nodes:
crash> p -x *(struct rb_node *)0xffff99a9502e0d10
$7 = {
  __rb_parent_color = 0xffff99a9502e0d10, <- points to SELF
  rb_right = 0xffff99a9502e0d10, <- points to self
  rb_left = 0xffff99a9502e1990 <- and we have a node left
}

The rb_left node:
crash> p -x *(struct rb_node *)0xffff99a9502e1990
$6 = {
  __rb_parent_color = 0xffff99a9502e0d11, <- points to the rb_root node (bit 0 is color)
  rb_right = 0x0, <- no leaf
  rb_left = 0x0 <- no leaf
}

I'm currently trying to extract the information what se (scheduling entity)
covers these nodes.
Anyway, the cfs_rq->tasks_timeline.rb_leftmost should point to 0xffff99a9502e1990
as far as i understand the rb_tree, right?

> >
> >I suppose one approach is to add code to both __enqueue_entity() and
> >__dequeue_entity() that compares ->rb_leftmost to the result of
> >rb_first(). That'd incur some overhead but it'd double check the logic.
> 
> We could benefit from improved debugging in rbtrees, not only the cached
> flavor. Perhaps we can start with the following -- this would at least
> let us know if the case where the tree is non-empty and leftmost is nil
> was hit, whether in the scheduler or another user...
> 
> Thanks,
> Davidlohr
> 
That's what i will do too, add some debugging stuff.
Add that to the project i'm on here, not upstream; and try
to log as much debug data as possible if a similar case occurs again.
But as rb_tree is excessively used i need to be careful where
to add debug code due to performance impact.

The approach you do with a configurable rb_tree debug
might help me here, yes; i would have taken a similar approach.

Thanks,
Carsten


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-12-06 10:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-03  9:11 Crash in fair scheduler Schmid, Carsten
2019-12-03 10:30 ` Valentin Schneider
2019-12-03 10:40   ` Dietmar Eggemann
2019-12-03 11:09     ` Valentin Schneider
2019-12-03 15:08       ` Dietmar Eggemann
2019-12-03 15:57         ` AW: " Schmid, Carsten
2019-12-03 10:30 ` Peter Zijlstra
2019-12-03 10:51   ` AW: " Schmid, Carsten
2019-12-03 14:01     ` Peter Zijlstra
2019-12-05 10:56       ` AW: " Schmid, Carsten
2019-12-05 17:41       ` Davidlohr Bueso
2019-12-06 10:11 AW: " Schmid, Carsten

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).