Crash in fair scheduler

From: "Schmid, Carsten" <Carsten_Schmid@mentor.com>
To: "mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Crash in fair scheduler
Date: Tue, 3 Dec 2019 09:11:14 +0000	[thread overview]
Message-ID: <1575364273836.74450@mentor.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

Hi maintainers of the fair scheduler,

we had a crash in the fair scheduler and analysis shows that this could happen again.
Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).

I have collected relevant data from the core dump and put together with the source.
So i was able to explain why the crash happened, but as i am not an expert on this
scheduler stuff i can't tell if any of the data fed into the code path is ok or not.

In the analysis appended you can find the details; please check if we ran into a corner
case that nobody expected yet or if the data is somehow corrupted (i really can't tell, for me it looks fine).
Can you please check if this is a bug in the scheduler?

To give some "surrounding" info:
It looks like a task has slept for a long time, even through some suspend/resume cycles.
When this task had to wake up, the failure occured.

If you need more data from the core dump, let me know.

Best regards
Carsten

[-- Attachment #2: fair_scheduler_crash.txt --]
[-- Type: text/plain, Size: 11368 bytes --]

On kernel 4.14.86 we had a kernel crash in the fair scheduler.
Kernel 4.14.86 is from 4.14 LTS series, and current 4.14 has no change in the affected code path.
Additionally we compared v5.4-rc2 source, there are changes but these wouldn't affect the result
and lead to the same crash as a similar execution path would be taken.
So we concentrate on the 4.14.86 source here.

The callstack of the crash:
[28439.656848] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[28439.665662] IP: set_next_entity+0x17/0x490
[28439.670244] PGD 0 P4D 0 
[28439.673096] Oops: 0000 [#1] PREEMPT SMP NOPTI
[28439.677972] Modules linked in: bcmdhd(O) sr_mod cdrom tntfs(PO) texfat(PO) usb_storage configfs squashfs zlib_inflate xz_dec ah4 xfrm4_mode_transport ebt_ip ebtable_filter ebtables xfrm_user xfrm_algo cls_u32 sch_htb intel_tfm_governor ecryptfs cfq_iosched sd_mod snd_soc_apl_mgu_hu intel_ipu4_psys intel_ipu4_psys_csslib intel_xhci_usb_role_switch dwc3 roles udc_core adv728x snd_soc_skl sdw_cnl snd_soc_acpi_intel_match coretemp snd_soc_acpi sbi_apl intel_ipu4_isys snd_soc_core videobuf2_dma_contig snd_compress videobuf2_memops ipu4_acpi i2c_i801 intel_ipu4_isys_csslib videobuf2_v4l2 snd_soc_skl_ipc videobuf2_core ahci libahci sdw_bus xhci_pci libata xhci_hcd crc8 snd_soc_sst_ipc cfg80211 snd_soc_sst_dsp scsi_mod usbcore snd_hda_ext_core rfkill usb_common dwc3_pci snd_hda_core intel_ipu4_mmu snd_pcm
[28439.757185]  mei_me snd_timer intel_ipu4 mei snd iova soundcore nfsd auth_rpcgss lockd grace sunrpc loop fuse 8021q bridge stp llc inap560t(O) i915 video backlight intel_gtt i2c_algo_bit drm_kms_helper igb_avb(O) drm firmware_class ptp hwmon spi_pxa2xx_platform pps_core [last unloaded: bcmdhd]
[28439.786334] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P     U     O    4.14.86-apl #1
[28439.794807] task: ffff99a97895a580 task.stack: ffffb21000078000
[28439.801430] RIP: 0010:set_next_entity+0x17/0x490
[28439.806597] RSP: 0018:ffffb2100007bdb0 EFLAGS: 00010082
[28439.812444] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[28439.820428] RDX: 00000000000003e1 RSI: 0000000000000000 RDI: ffff99a96dda9800
[28439.828408] RBP: ffffb2100007bdf0 R08: 0000000000000000 R09: 0000000000000000
[28439.836395] R10: 0000000000000000 R11: 0000042df10058da R12: ffff99a97895a580
[28439.844383] R13: ffff99a96dda9800 R14: 0000000000000000 R15: ffff99a96dda9800
[28439.852371] FS:  0000000000000000(0000) GS:ffff99a97fd80000(0000) knlGS:0000000000000000
[28439.861429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28439.867864] CR2: 0000000000000038 CR3: 00000001f7a64000 CR4: 00000000003406a0
[28439.875852] Call Trace:
[28439.878590]  pick_next_task_fair+0x517/0x560
[28439.883369]  __schedule+0xfa/0x7b0
[28439.887172]  schedule_idle+0x1d/0x40
[28439.891170]  do_idle+0x134/0x1a0
[28439.894779]  cpu_startup_entry+0x6e/0x70
[28439.899168]  start_secondary+0x1a7/0x1f0
[28439.903555]  secondary_startup_64+0xa5/0xb0
[28439.908233] Code: 70 44 29 f6 41 89 d5 e8 58 f9 ff ff e9 11 ff ff ff 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 89 f3 48 83 ec 18 <8b> 4e 38 85 c9 75 3d 4d 8b a5 d0 00 00 00 41 83 bc 24 f8 08 00 
[28439.929423] RIP: set_next_entity+0x17/0x490 RSP: ffffb2100007bdb0
[28439.936238] CR2: 0000000000000038

A dump of cfs_rq which is used here (a bit compressed to ease reading):
crash> * cfs_rq ffff99a96dda9800
struct cfs_rq {
  load = {  weight = 1048576,  inv_weight = 0  }, 
  nr_running = 1, 
  h_nr_running = 1, 
  exec_clock = 0, 
  min_vruntime = 190894920101, 
  tasks_timeline = {  rb_root = {    rb_node = 0xffff99a9502e0d10  },   rb_leftmost = 0x0  }, 
  curr = 0x0, 
  next = 0x0, 
  last = 0x0, 
  skip = 0x0, 
  nr_spread_over = 0, 
  avg = { last_update_time = 28439656814592,  load_sum = 7948862,  util_sum = 1838088,
          period_contrib = 278,  load_avg = 167,  util_avg = 37  }, 
  runnable_load_sum = 5920382, 
  runnable_load_avg = 124, 
  tg_load_avg_contrib = 167, 
  propagate_avg = 0, 
  removed_load_avg = { counter = 0 }, 
  removed_util_avg = { counter = 0 }, 
  h_load = 26, 
  last_h_load_update = 4323103347, 
  h_load_next = 0x0, 
  rq = 0xffff99a97fd9f500, 
  on_list = 1, 
  leaf_cfs_rq_list = { next = 0xffff99a9782ad4e0, prev = 0xffff99a9779bece0 }, 
  tg = 0xffff99a977b35d40, 
  runtime_enabled = 0, 
  runtime_expires = 0, 
  runtime_remaining = 0, 
  throttled_clock = 0, 
  throttled_clock_task = 0, 
  throttled_clock_task_time = 0, 
  throttled = 0, 
  throttle_count = 0, 
  throttled_list = { next = 0xffff99a96dda9930, prev = 0xffff99a96dda9930 }
}

and the complete stack frame of the crash (with some comments added, removed the crash related TOS):
crash> bt -FF
PID: 0      TASK: ffff99a97895a580  CPU: 3   COMMAND: "swapper/3"
 #0 [ffffb2100007ba20] machine_kexec at ffffffffaa035a5f
--- snip ---
 #9 [ffffb2100007bd00] page_fault at ffffffffaa601215
    [exception RIP: set_next_entity+23]
    RIP: ffffffffaa0ad2e7  RSP: ffffb2100007bdb0  RFLAGS: 00010082
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000000003e1  RSI: 0000000000000000  RDI: ffff99a96dda9800
    RBP: ffffb2100007bdf0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000042df10058da  R12: ffff99a97895a580
    R13: ffff99a96dda9800  R14: 0000000000000000  R15: ffff99a96dda9800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    ffffb2100007bd08: [ffff99a96dda9800:kmalloc-512] 0000000000000000 
    ffffb2100007bd18: [ffff99a96dda9800:kmalloc-512] [ffff99a97895a580:task_struct] 
    ffffb2100007bd28: ffffb2100007bdf0 0000000000000000 
    ffffb2100007bd38: 0000042df10058da 0000000000000000 
    ffffb2100007bd48: 0000000000000000 0000000000000000 
    ffffb2100007bd58: 0000000000000000 0000000000000000 
    ffffb2100007bd68: 00000000000003e1 0000000000000000 
    ffffb2100007bd78: [ffff99a96dda9800:kmalloc-512] ffffffffffffffff 
    ffffb2100007bd88: set_next_entity+23 0000000000000010 
    ffffb2100007bd98: 0000000000010082 ffffb2100007bdb0 
    ffffb2100007bda8: 0000000000000018 0000000000000002 
    ffffb2100007bdb8: 0000000000000000 0000000000000000 
    ffffb2100007bdc8: ffff99a97fd9f500 [ffff99a97895a580:task_struct] 
    ffffb2100007bdd8: [ffff99a97895aab8:task_struct] 0000000000000000 
    ffffb2100007bde8: [ffff99a96dda9800:kmalloc-512] ffffb2100007be60:rbp 
    ffffb2100007bdf8: pick_next_task_fair+1303 

#10 [ffffb2100007bdf8] pick_next_task_fair at ffffffffaa0b5457
    ffffb2100007be00: ffffb2100007be10 
                      0000000000000018 
    ffffb2100007be10: ffff99a97fd9f560 QWORD PTR gs:0x28
                      ffffb2100007be80 
    ffffb2100007be20: [ffff99a97895a580:task_struct] saved param *prev
                      ffffb2100007be48
    ffffb2100007be30: f048ce7809464d00 ffff99a97fd9f500=rbx 
    ffffb2100007be40: [ffff99a97895a580:task_struct]=r12 [ffff99a97895aab8:task_struct]=r13 
    ffffb2100007be50: 0000000000000000=r14  0000000000000000=r15  
    ffffb2100007be60: ffffb2100007bec0=rbp  __schedule+250   

The failing piece of code is in kernel/sched/fair.c (Line 3873, Kernel 4.14.86; Line 4153 Kernel v5.4-rc2), because se is NULL:
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/* 'current' is not kept within the tree. */
	if (se->on_rq) { <<<<<<< crash here

set_next_entity is called from within pick_next_task_fair, from the following piece of code:
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	if (!cfs_rq->nr_running) // this is 1, so we are not going to idle
		goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
	if (prev->sched_class != &fair_sched_class) <<<<< this is true:
							crash> p &fair_sched_class
							$1 = (const struct sched_class *) 0xffffffffaaa10cc0 <<<<
							crash> $prev=ffff99a97895a580
							crash> gdb set $prev=(struct task_struct *)0xffff99a97895a580
							crash> p $prev->sched_class
							$2 = (const struct sched_class *) 0xffffffffaaa10b40 <<<<
		goto simple; <<<< so we go to simple
....
(Line 6360, Kernel 4.14.86; Line 6820 Kernel v5.4-rc2)
simple:
#endif

	put_prev_task(rq, prev);

	do {
		se = pick_next_entity(cfs_rq, NULL); <<<< this returns se=NULL
		set_next_entity(cfs_rq, se); <<<<<<<< here we crash
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

So why is se = NULL returned?

Looking at pick_next_entity and cfs_rq fed into it together with curr = NULL:
(Line 3915, Kernel 4.14.86; Line 4196 Kernel v5.4-rc2)
/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	struct sched_entity *left = __pick_first_entity(cfs_rq); <<<< returns NULL
	struct sched_entity *se;

&cfs_rq->tasks_timeline->rb_leftmost
  tasks_timeline = {
    rb_root = {
      rb_node = 0xffff99a9502e0d10
    }, 
    rb_leftmost = 0x0
  }, 
include/linux/rbtree.h:91:#define rb_first_cached(root) (root)->rb_leftmost
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);

	if (!left)
		return NULL; <<<<<<<<<< the case

	return rb_entry(left, struct sched_entity, run_node);
}

Back in pick_next_entity, we now have
curr = NULL
left = NULL

	/*
	 * If curr is set we have to see if its left of the leftmost entity
	 * still in the tree, provided there was anything in the tree at all.
	 */
	if (!left <<< this is the case here >>>> || (curr && entity_before(curr, left)))
		left = curr; <<<< so we still get left = curr = NULL

	se = left; /* ideally we run the leftmost entity */ <<<< but we now have se = NULL ????

	/*
	 * Avoid running the skip buddy, if running something else can
	 * be done without getting too unfair.
	 */
	if (cfs_rq->skip == se) { <<<< that's equal here, se = NULL = 0x0 = cfs_rq->skip
		struct sched_entity *second;

		if (se == curr) { <<<< and this too, NULL = NULL
			second = __pick_first_entity(cfs_rq); <<<< so we get again NULL like we did for *left
		} else {
			second = __pick_next_entity(se);
			if (!second || (curr && entity_before(curr, second)))
				second = curr;
		}

		if (second && wakeup_preempt_entity(second, left) < 1) <<<< as second is also NULL nothing is done here, se unchanged
			se = second;
	}

	/*
	 * Prefer last buddy, try to return the CPU to a preempted task.
	 */
	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) <<<< cfs_rq->last = 0x0, not true, se unchanged
		se = cfs_rq->last;

	/*
	 * Someone really wants this to run. If it's not unfair, run it.
	 */
	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) <<<< cfs_rq->next = 0x0, not true, se unchanged
		se = cfs_rq->next;

	<<<< we reach this point with se = NULL >>>>
	clear_buddies(cfs_rq, se); <<<< this doesn't change anything in se

	return se; <<<< so we get se=NULL as a result.
}

Is this a corner case nobody thought of or do we have cfs_rq data that is unexpected in it's content?