From: "Schmid, Carsten" <Carsten_Schmid@mentor.com>
To: "mingo@redhat.com" <mingo@redhat.com>,
"peterz@infradead.org" <peterz@infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Crash in fair scheduler
Date: Tue, 3 Dec 2019 09:11:14 +0000 [thread overview]
Message-ID: <1575364273836.74450@mentor.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]
Hi maintainers of the fair scheduler,
we had a crash in the fair scheduler and analysis shows that this could happen again.
Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).
I have collected relevant data from the core dump and put together with the source.
So i was able to explain why the crash happened, but as i am not an expert on this
scheduler stuff i can't tell if any of the data fed into the code path is ok or not.
In the analysis appended you can find the details; please check if we ran into a corner
case that nobody expected yet or if the data is somehow corrupted (i really can't tell, for me it looks fine).
Can you please check if this is a bug in the scheduler?
To give some "surrounding" info:
It looks like a task has slept for a long time, even through some suspend/resume cycles.
When this task had to wake up, the failure occured.
If you need more data from the core dump, let me know.
Best regards
Carsten
[-- Attachment #2: fair_scheduler_crash.txt --]
[-- Type: text/plain, Size: 11368 bytes --]
On kernel 4.14.86 we had a kernel crash in the fair scheduler.
Kernel 4.14.86 is from 4.14 LTS series, and current 4.14 has no change in the affected code path.
Additionally we compared v5.4-rc2 source, there are changes but these wouldn't affect the result
and lead to the same crash as a similar execution path would be taken.
So we concentrate on the 4.14.86 source here.
The callstack of the crash:
[28439.656848] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[28439.665662] IP: set_next_entity+0x17/0x490
[28439.670244] PGD 0 P4D 0
[28439.673096] Oops: 0000 [#1] PREEMPT SMP NOPTI
[28439.677972] Modules linked in: bcmdhd(O) sr_mod cdrom tntfs(PO) texfat(PO) usb_storage configfs squashfs zlib_inflate xz_dec ah4 xfrm4_mode_transport ebt_ip ebtable_filter ebtables xfrm_user xfrm_algo cls_u32 sch_htb intel_tfm_governor ecryptfs cfq_iosched sd_mod snd_soc_apl_mgu_hu intel_ipu4_psys intel_ipu4_psys_csslib intel_xhci_usb_role_switch dwc3 roles udc_core adv728x snd_soc_skl sdw_cnl snd_soc_acpi_intel_match coretemp snd_soc_acpi sbi_apl intel_ipu4_isys snd_soc_core videobuf2_dma_contig snd_compress videobuf2_memops ipu4_acpi i2c_i801 intel_ipu4_isys_csslib videobuf2_v4l2 snd_soc_skl_ipc videobuf2_core ahci libahci sdw_bus xhci_pci libata xhci_hcd crc8 snd_soc_sst_ipc cfg80211 snd_soc_sst_dsp scsi_mod usbcore snd_hda_ext_core rfkill usb_common dwc3_pci snd_hda_core intel_ipu4_mmu snd_pcm
[28439.757185] mei_me snd_timer intel_ipu4 mei snd iova soundcore nfsd auth_rpcgss lockd grace sunrpc loop fuse 8021q bridge stp llc inap560t(O) i915 video backlight intel_gtt i2c_algo_bit drm_kms_helper igb_avb(O) drm firmware_class ptp hwmon spi_pxa2xx_platform pps_core [last unloaded: bcmdhd]
[28439.786334] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P U O 4.14.86-apl #1
[28439.794807] task: ffff99a97895a580 task.stack: ffffb21000078000
[28439.801430] RIP: 0010:set_next_entity+0x17/0x490
[28439.806597] RSP: 0018:ffffb2100007bdb0 EFLAGS: 00010082
[28439.812444] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[28439.820428] RDX: 00000000000003e1 RSI: 0000000000000000 RDI: ffff99a96dda9800
[28439.828408] RBP: ffffb2100007bdf0 R08: 0000000000000000 R09: 0000000000000000
[28439.836395] R10: 0000000000000000 R11: 0000042df10058da R12: ffff99a97895a580
[28439.844383] R13: ffff99a96dda9800 R14: 0000000000000000 R15: ffff99a96dda9800
[28439.852371] FS: 0000000000000000(0000) GS:ffff99a97fd80000(0000) knlGS:0000000000000000
[28439.861429] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28439.867864] CR2: 0000000000000038 CR3: 00000001f7a64000 CR4: 00000000003406a0
[28439.875852] Call Trace:
[28439.878590] pick_next_task_fair+0x517/0x560
[28439.883369] __schedule+0xfa/0x7b0
[28439.887172] schedule_idle+0x1d/0x40
[28439.891170] do_idle+0x134/0x1a0
[28439.894779] cpu_startup_entry+0x6e/0x70
[28439.899168] start_secondary+0x1a7/0x1f0
[28439.903555] secondary_startup_64+0xa5/0xb0
[28439.908233] Code: 70 44 29 f6 41 89 d5 e8 58 f9 ff ff e9 11 ff ff ff 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 89 f3 48 83 ec 18 <8b> 4e 38 85 c9 75 3d 4d 8b a5 d0 00 00 00 41 83 bc 24 f8 08 00
[28439.929423] RIP: set_next_entity+0x17/0x490 RSP: ffffb2100007bdb0
[28439.936238] CR2: 0000000000000038
A dump of cfs_rq which is used here (a bit compressed to ease reading):
crash> * cfs_rq ffff99a96dda9800
struct cfs_rq {
load = { weight = 1048576, inv_weight = 0 },
nr_running = 1,
h_nr_running = 1,
exec_clock = 0,
min_vruntime = 190894920101,
tasks_timeline = { rb_root = { rb_node = 0xffff99a9502e0d10 }, rb_leftmost = 0x0 },
curr = 0x0,
next = 0x0,
last = 0x0,
skip = 0x0,
nr_spread_over = 0,
avg = { last_update_time = 28439656814592, load_sum = 7948862, util_sum = 1838088,
period_contrib = 278, load_avg = 167, util_avg = 37 },
runnable_load_sum = 5920382,
runnable_load_avg = 124,
tg_load_avg_contrib = 167,
propagate_avg = 0,
removed_load_avg = { counter = 0 },
removed_util_avg = { counter = 0 },
h_load = 26,
last_h_load_update = 4323103347,
h_load_next = 0x0,
rq = 0xffff99a97fd9f500,
on_list = 1,
leaf_cfs_rq_list = { next = 0xffff99a9782ad4e0, prev = 0xffff99a9779bece0 },
tg = 0xffff99a977b35d40,
runtime_enabled = 0,
runtime_expires = 0,
runtime_remaining = 0,
throttled_clock = 0,
throttled_clock_task = 0,
throttled_clock_task_time = 0,
throttled = 0,
throttle_count = 0,
throttled_list = { next = 0xffff99a96dda9930, prev = 0xffff99a96dda9930 }
}
and the complete stack frame of the crash (with some comments added, removed the crash related TOS):
crash> bt -FF
PID: 0 TASK: ffff99a97895a580 CPU: 3 COMMAND: "swapper/3"
#0 [ffffb2100007ba20] machine_kexec at ffffffffaa035a5f
--- snip ---
#9 [ffffb2100007bd00] page_fault at ffffffffaa601215
[exception RIP: set_next_entity+23]
RIP: ffffffffaa0ad2e7 RSP: ffffb2100007bdb0 RFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000003e1 RSI: 0000000000000000 RDI: ffff99a96dda9800
RBP: ffffb2100007bdf0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000042df10058da R12: ffff99a97895a580
R13: ffff99a96dda9800 R14: 0000000000000000 R15: ffff99a96dda9800
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
ffffb2100007bd08: [ffff99a96dda9800:kmalloc-512] 0000000000000000
ffffb2100007bd18: [ffff99a96dda9800:kmalloc-512] [ffff99a97895a580:task_struct]
ffffb2100007bd28: ffffb2100007bdf0 0000000000000000
ffffb2100007bd38: 0000042df10058da 0000000000000000
ffffb2100007bd48: 0000000000000000 0000000000000000
ffffb2100007bd58: 0000000000000000 0000000000000000
ffffb2100007bd68: 00000000000003e1 0000000000000000
ffffb2100007bd78: [ffff99a96dda9800:kmalloc-512] ffffffffffffffff
ffffb2100007bd88: set_next_entity+23 0000000000000010
ffffb2100007bd98: 0000000000010082 ffffb2100007bdb0
ffffb2100007bda8: 0000000000000018 0000000000000002
ffffb2100007bdb8: 0000000000000000 0000000000000000
ffffb2100007bdc8: ffff99a97fd9f500 [ffff99a97895a580:task_struct]
ffffb2100007bdd8: [ffff99a97895aab8:task_struct] 0000000000000000
ffffb2100007bde8: [ffff99a96dda9800:kmalloc-512] ffffb2100007be60:rbp
ffffb2100007bdf8: pick_next_task_fair+1303
#10 [ffffb2100007bdf8] pick_next_task_fair at ffffffffaa0b5457
ffffb2100007be00: ffffb2100007be10
0000000000000018
ffffb2100007be10: ffff99a97fd9f560 QWORD PTR gs:0x28
ffffb2100007be80
ffffb2100007be20: [ffff99a97895a580:task_struct] saved param *prev
ffffb2100007be48
ffffb2100007be30: f048ce7809464d00 ffff99a97fd9f500=rbx
ffffb2100007be40: [ffff99a97895a580:task_struct]=r12 [ffff99a97895aab8:task_struct]=r13
ffffb2100007be50: 0000000000000000=r14 0000000000000000=r15
ffffb2100007be60: ffffb2100007bec0=rbp __schedule+250
The failing piece of code is in kernel/sched/fair.c (Line 3873, Kernel 4.14.86; Line 4153 Kernel v5.4-rc2), because se is NULL:
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/* 'current' is not kept within the tree. */
if (se->on_rq) { <<<<<<< crash here
set_next_entity is called from within pick_next_task_fair, from the following piece of code:
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
again:
if (!cfs_rq->nr_running) // this is 1, so we are not going to idle
goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED
if (prev->sched_class != &fair_sched_class) <<<<< this is true:
crash> p &fair_sched_class
$1 = (const struct sched_class *) 0xffffffffaaa10cc0 <<<<
crash> $prev=ffff99a97895a580
crash> gdb set $prev=(struct task_struct *)0xffff99a97895a580
crash> p $prev->sched_class
$2 = (const struct sched_class *) 0xffffffffaaa10b40 <<<<
goto simple; <<<< so we go to simple
....
(Line 6360, Kernel 4.14.86; Line 6820 Kernel v5.4-rc2)
simple:
#endif
put_prev_task(rq, prev);
do {
se = pick_next_entity(cfs_rq, NULL); <<<< this returns se=NULL
set_next_entity(cfs_rq, se); <<<<<<<< here we crash
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
So why is se = NULL returned?
Looking at pick_next_entity and cfs_rq fed into it together with curr = NULL:
(Line 3915, Kernel 4.14.86; Line 4196 Kernel v5.4-rc2)
/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
struct sched_entity *left = __pick_first_entity(cfs_rq); <<<< returns NULL
struct sched_entity *se;
&cfs_rq->tasks_timeline->rb_leftmost
tasks_timeline = {
rb_root = {
rb_node = 0xffff99a9502e0d10
},
rb_leftmost = 0x0
},
include/linux/rbtree.h:91:#define rb_first_cached(root) (root)->rb_leftmost
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
if (!left)
return NULL; <<<<<<<<<< the case
return rb_entry(left, struct sched_entity, run_node);
}
Back in pick_next_entity, we now have
curr = NULL
left = NULL
/*
* If curr is set we have to see if its left of the leftmost entity
* still in the tree, provided there was anything in the tree at all.
*/
if (!left <<< this is the case here >>>> || (curr && entity_before(curr, left)))
left = curr; <<<< so we still get left = curr = NULL
se = left; /* ideally we run the leftmost entity */ <<<< but we now have se = NULL ????
/*
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
if (cfs_rq->skip == se) { <<<< that's equal here, se = NULL = 0x0 = cfs_rq->skip
struct sched_entity *second;
if (se == curr) { <<<< and this too, NULL = NULL
second = __pick_first_entity(cfs_rq); <<<< so we get again NULL like we did for *left
} else {
second = __pick_next_entity(se);
if (!second || (curr && entity_before(curr, second)))
second = curr;
}
if (second && wakeup_preempt_entity(second, left) < 1) <<<< as second is also NULL nothing is done here, se unchanged
se = second;
}
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) <<<< cfs_rq->last = 0x0, not true, se unchanged
se = cfs_rq->last;
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) <<<< cfs_rq->next = 0x0, not true, se unchanged
se = cfs_rq->next;
<<<< we reach this point with se = NULL >>>>
clear_buddies(cfs_rq, se); <<<< this doesn't change anything in se
return se; <<<< so we get se=NULL as a result.
}
Is this a corner case nobody thought of or do we have cfs_rq data that is unexpected in it's content?
next reply other threads:[~2019-12-03 9:11 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-03 9:11 Schmid, Carsten [this message]
2019-12-03 10:30 ` Crash in fair scheduler Valentin Schneider
2019-12-03 10:40 ` Dietmar Eggemann
2019-12-03 11:09 ` Valentin Schneider
2019-12-03 15:08 ` Dietmar Eggemann
2019-12-03 15:57 ` AW: " Schmid, Carsten
2019-12-03 10:30 ` Peter Zijlstra
2019-12-03 10:51 ` AW: " Schmid, Carsten
2019-12-03 14:01 ` Peter Zijlstra
2019-12-05 10:56 ` AW: " Schmid, Carsten
2019-12-05 17:41 ` Davidlohr Bueso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1575364273836.74450@mentor.com \
--to=carsten_schmid@mentor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).