linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
@ 2020-06-26  0:47 Dave Chinner
  2020-06-26  5:57 ` Dave Chinner
  2020-06-26  7:33 ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-06-26  0:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-xfs, Ingo Molnar, Peter Zijlstra

Hi folks,

I turned on CONFIG_PREEMPT=y yesterday to try to reproduce problems
Darrick was having with one of my patchsets. We've both been seeing
a dead stop panic on these configs, and I managed to find a
relatively reliable reproducer in fstests generic/127. It's
basically just single fsx process exercising a single file, and it
results in this happening within 15 minutes of starting the test
running in a loop:

[ 1102.169209] BUG: kernel NULL pointer dereference, address: 0000000000000150
[ 1102.171270] #PF: supervisor read access in kernel mode
[ 1102.172894] #PF: error_code(0x0000) - not-present page
[ 1102.174408] PGD 0 P4D 0
[ 1102.175136] Oops: 0000 [#1] PREEMPT SMP
[ 1102.176293] CPU: 2 PID: 909 Comm: kworker/2:1H Not tainted 5.8.0-rc2-dgc+ #2469
[ 1102.178395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1102.180762] Workqueue: xfs-log/pmem0 xlog_ioend_work
[ 1102.182286] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
[ 1102.183804] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
[ 1102.189125] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
[ 1102.190625] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
[ 1102.192737] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
[ 1102.194827] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
[ 1102.196886] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
[ 1102.199040] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
[ 1102.200883] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
[ 1102.203306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1102.205024] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
[ 1102.207117] Call Trace:
[ 1102.207895]  <IRQ>
[ 1102.208500]  ? enqueue_task_fair+0x1d7/0x9f0
[ 1102.209709]  check_preempt_curr+0x74/0x80
[ 1102.210931]  ttwu_do_wakeup+0x1e/0x170
[ 1102.212064]  ttwu_do_activate+0x5b/0x70
[ 1102.213225]  sched_ttwu_pending+0x94/0xe0
[ 1102.214410]  flush_smp_call_function_queue+0xf1/0x190
[ 1102.215885]  generic_smp_call_function_single_interrupt+0x13/0x20
[ 1102.217790]  __sysvec_call_function_single+0x2b/0xe0
[ 1102.219375]  asm_call_on_stack+0xf/0x20
[ 1102.220599]  </IRQ>
[ 1102.221280]  sysvec_call_function_single+0x7e/0x90
[ 1102.222854]  asm_sysvec_call_function_single+0x12/0x20
[ 1102.224515] RIP: 0010:_raw_spin_unlock_irqrestore+0x14/0x30
[ 1102.226350] Code: e8 e8 20 25 ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 48 89 f3 e8 5e 11 39 ff 53 9d <65> ff 0d c5 72 26 7e 74 03 5b 5d f
[ 1102.232213] RSP: 0018:ffffc900020b7cc8 EFLAGS: 00000246
[ 1102.233902] RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
[ 1102.236134] RDX: 0000000000000002 RSI: 0000000000000246 RDI: ffff88852679a400
[ 1102.238402] RBP: ffffc900020b7cd0 R08: ffff88852679a400 R09: ffffc900020b7ce8
[ 1102.240598] R10: ffff88852a04e480 R11: 0000000000000001 R12: 00000000ffffffff
[ 1102.242906] R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000003
[ 1102.245191]  __wake_up_common_lock+0x8a/0xc0
[ 1102.246572]  __wake_up+0x13/0x20
[ 1102.247636]  xlog_state_clean_iclog+0xf7/0x1a0
[ 1102.249075]  xlog_state_do_callback+0x257/0x300
[ 1102.250548]  xlog_state_done_syncing+0x69/0xb0
[ 1102.251958]  xlog_ioend_work+0x6c/0xc0
[ 1102.253151]  process_one_work+0x1a6/0x390
[ 1102.254403]  worker_thread+0x50/0x3b0
[ 1102.255595]  ? process_one_work+0x390/0x390
[ 1102.256913]  kthread+0x131/0x170
[ 1102.257993]  ? __kthread_create_on_node+0x1b0/0x1b0
[ 1102.259546]  ret_from_fork+0x1f/0x30
[ 1102.260707] CR2: 0000000000000150
[ 1102.261779] ---[ end trace d5f0aeef2eb333bd ]---
[ 1102.263238] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
[ 1102.264848] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
[ 1102.270645] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
[ 1102.272237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
[ 1102.274439] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
[ 1102.276613] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
[ 1102.278797] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
[ 1102.280930] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
[ 1102.283094] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
[ 1102.285648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1102.287415] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
[ 1102.289606] Kernel panic - not syncing: Fatal exception in interrupt
[ 1102.291850] Kernel Offset: disabled
[ 1102.293002] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

This was from a vanilla 5.8-rc2 kernel, a current linus tree also
fails like this. It looks like it is taking a scheduler preempt IPI
and trying to do a task akeup while already processing tasks wakeups
on that CPU....

A 5.7 kernel survived for about 20 minutes - not conclusive that the
bug didn't exist on that kernel, but in general it reproduces within
5 minutes of starting the test looping.  Pre-empt related config
options:

$ grep PREEMPT .config
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPTIRQ_EVENTS is not set
# CONFIG_PREEMPT_TRACER is not set
$

I just reproduced it on commit d479c5a1919b ("Merge tag
'sched-core-2020-06-02' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"), and I'm
going to try to do a bisect it. I'm not sure this is going to be
reliable, because running for 20+ minutes isn't a guarantee the
problem isn't there...

More info will follow as I find it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-26  0:47 [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS Dave Chinner
@ 2020-06-26  5:57 ` Dave Chinner
  2020-06-26  7:33 ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2020-06-26  5:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-xfs, Ingo Molnar, Peter Zijlstra

On Fri, Jun 26, 2020 at 10:47:22AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> I turned on CONFIG_PREEMPT=y yesterday to try to reproduce problems
> Darrick was having with one of my patchsets. We've both been seeing
> a dead stop panic on these configs, and I managed to find a
> relatively reliable reproducer in fstests generic/127. It's
> basically just single fsx process exercising a single file, and it
> results in this happening within 15 minutes of starting the test
> running in a loop:
> 
> [ 1102.169209] BUG: kernel NULL pointer dereference, address: 0000000000000150
> [ 1102.171270] #PF: supervisor read access in kernel mode
> [ 1102.172894] #PF: error_code(0x0000) - not-present page
> [ 1102.174408] PGD 0 P4D 0
> [ 1102.175136] Oops: 0000 [#1] PREEMPT SMP
> [ 1102.176293] CPU: 2 PID: 909 Comm: kworker/2:1H Not tainted 5.8.0-rc2-dgc+ #2469
> [ 1102.178395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 1102.180762] Workqueue: xfs-log/pmem0 xlog_ioend_work
> [ 1102.182286] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
> [ 1102.183804] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
> [ 1102.189125] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
> [ 1102.190625] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
> [ 1102.192737] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
> [ 1102.194827] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
> [ 1102.196886] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
> [ 1102.199040] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
> [ 1102.200883] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
> [ 1102.203306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1102.205024] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
> [ 1102.207117] Call Trace:
> [ 1102.207895]  <IRQ>
> [ 1102.208500]  ? enqueue_task_fair+0x1d7/0x9f0
> [ 1102.209709]  check_preempt_curr+0x74/0x80
> [ 1102.210931]  ttwu_do_wakeup+0x1e/0x170
> [ 1102.212064]  ttwu_do_activate+0x5b/0x70
> [ 1102.213225]  sched_ttwu_pending+0x94/0xe0
> [ 1102.214410]  flush_smp_call_function_queue+0xf1/0x190
> [ 1102.215885]  generic_smp_call_function_single_interrupt+0x13/0x20
> [ 1102.217790]  __sysvec_call_function_single+0x2b/0xe0
> [ 1102.219375]  asm_call_on_stack+0xf/0x20
> [ 1102.220599]  </IRQ>
> [ 1102.221280]  sysvec_call_function_single+0x7e/0x90
> [ 1102.222854]  asm_sysvec_call_function_single+0x12/0x20
> [ 1102.224515] RIP: 0010:_raw_spin_unlock_irqrestore+0x14/0x30
> [ 1102.226350] Code: e8 e8 20 25 ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 48 89 f3 e8 5e 11 39 ff 53 9d <65> ff 0d c5 72 26 7e 74 03 5b 5d f
> [ 1102.232213] RSP: 0018:ffffc900020b7cc8 EFLAGS: 00000246
> [ 1102.233902] RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
> [ 1102.236134] RDX: 0000000000000002 RSI: 0000000000000246 RDI: ffff88852679a400
> [ 1102.238402] RBP: ffffc900020b7cd0 R08: ffff88852679a400 R09: ffffc900020b7ce8
> [ 1102.240598] R10: ffff88852a04e480 R11: 0000000000000001 R12: 00000000ffffffff
> [ 1102.242906] R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000003
> [ 1102.245191]  __wake_up_common_lock+0x8a/0xc0
> [ 1102.246572]  __wake_up+0x13/0x20
> [ 1102.247636]  xlog_state_clean_iclog+0xf7/0x1a0
> [ 1102.249075]  xlog_state_do_callback+0x257/0x300
> [ 1102.250548]  xlog_state_done_syncing+0x69/0xb0
> [ 1102.251958]  xlog_ioend_work+0x6c/0xc0
> [ 1102.253151]  process_one_work+0x1a6/0x390
> [ 1102.254403]  worker_thread+0x50/0x3b0
> [ 1102.255595]  ? process_one_work+0x390/0x390
> [ 1102.256913]  kthread+0x131/0x170
> [ 1102.257993]  ? __kthread_create_on_node+0x1b0/0x1b0
> [ 1102.259546]  ret_from_fork+0x1f/0x30
> [ 1102.260707] CR2: 0000000000000150
> [ 1102.261779] ---[ end trace d5f0aeef2eb333bd ]---
> [ 1102.263238] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
> [ 1102.264848] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
> [ 1102.270645] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
> [ 1102.272237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
> [ 1102.274439] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
> [ 1102.276613] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
> [ 1102.278797] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
> [ 1102.280930] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
> [ 1102.283094] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
> [ 1102.285648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1102.287415] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
> [ 1102.289606] Kernel panic - not syncing: Fatal exception in interrupt
> [ 1102.291850] Kernel Offset: disabled
> [ 1102.293002] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> 
> This was from a vanilla 5.8-rc2 kernel, a current linus tree also
> fails like this. It looks like it is taking a scheduler preempt IPI
> and trying to do a task akeup while already processing tasks wakeups
> on that CPU....
> 
> A 5.7 kernel survived for about 20 minutes - not conclusive that the
> bug didn't exist on that kernel, but in general it reproduces within
> 5 minutes of starting the test looping.  Pre-empt related config
> options:
> 
> $ grep PREEMPT .config
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_DEBUG_PREEMPT is not set
> # CONFIG_PREEMPTIRQ_EVENTS is not set
> # CONFIG_PREEMPT_TRACER is not set
> $
> 
> I just reproduced it on commit d479c5a1919b ("Merge tag
> 'sched-core-2020-06-02' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"), and I'm
> going to try to do a bisect it. I'm not sure this is going to be
> reliable, because running for 20+ minutes isn't a guarantee the
> problem isn't there...
> 
> More info will follow as I find it.

Ok, so this is definitely caused by a commit inside the above merge;
the prior merge commit does not show this issue.

After wasting the entire day on two failed bisects (i.e. three hours
of testing to get to "first bad commit", test previous "good"
commit, watch it fail instantly) manual testing indicates that
this commit:

2ebb17717550 sched/core: Offload wakee task activation if it the wakee is descheduling

Is the first commit that I have been able to reproduce the failure.
The prior commit:

c6e7bd7afaeb sched/core: Optimize ttwu() spinning on p->on_cpu

Has run for almost an hour now without failing, and it was tested in
one of the bisecs for half an hour without failing, too.

Hence I'd say it's most likely a regression introduced by
2ebb17717550.

Cheers,

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-26  0:47 [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS Dave Chinner
  2020-06-26  5:57 ` Dave Chinner
@ 2020-06-26  7:33 ` Peter Zijlstra
  2020-06-26 22:32   ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-06-26  7:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Fri, Jun 26, 2020 at 10:47:22AM +1000, Dave Chinner wrote:
> [ 1102.169209] BUG: kernel NULL pointer dereference, address: 0000000000000150
> [ 1102.171270] #PF: supervisor read access in kernel mode
> [ 1102.172894] #PF: error_code(0x0000) - not-present page
> [ 1102.174408] PGD 0 P4D 0
> [ 1102.175136] Oops: 0000 [#1] PREEMPT SMP
> [ 1102.176293] CPU: 2 PID: 909 Comm: kworker/2:1H Not tainted 5.8.0-rc2-dgc+ #2469
> [ 1102.178395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 1102.180762] Workqueue: xfs-log/pmem0 xlog_ioend_work
> [ 1102.182286] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
> [ 1102.183804] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
> [ 1102.189125] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
> [ 1102.190625] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
> [ 1102.192737] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
> [ 1102.194827] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
> [ 1102.196886] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
> [ 1102.199040] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
> [ 1102.200883] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
> [ 1102.203306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1102.205024] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
> [ 1102.207117] Call Trace:
> [ 1102.207895]  <IRQ>
> [ 1102.208500]  ? enqueue_task_fair+0x1d7/0x9f0
> [ 1102.209709]  check_preempt_curr+0x74/0x80
> [ 1102.210931]  ttwu_do_wakeup+0x1e/0x170
> [ 1102.212064]  ttwu_do_activate+0x5b/0x70
> [ 1102.213225]  sched_ttwu_pending+0x94/0xe0
> [ 1102.214410]  flush_smp_call_function_queue+0xf1/0x190
> [ 1102.215885]  generic_smp_call_function_single_interrupt+0x13/0x20
> [ 1102.217790]  __sysvec_call_function_single+0x2b/0xe0
> [ 1102.219375]  asm_call_on_stack+0xf/0x20
> [ 1102.220599]  </IRQ>
> [ 1102.221280]  sysvec_call_function_single+0x7e/0x90
> [ 1102.222854]  asm_sysvec_call_function_single+0x12/0x20

https://git.kernel.org/tip/964ed98b075263faabe416eeebac99a9bef3f06c

Should be headed to Linus soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-26  7:33 ` Peter Zijlstra
@ 2020-06-26 22:32   ` Dave Chinner
  2020-06-27 18:30     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-06-26 22:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Fri, Jun 26, 2020 at 09:33:45AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 26, 2020 at 10:47:22AM +1000, Dave Chinner wrote:
> > [ 1102.169209] BUG: kernel NULL pointer dereference, address: 0000000000000150
> > [ 1102.171270] #PF: supervisor read access in kernel mode
> > [ 1102.172894] #PF: error_code(0x0000) - not-present page
> > [ 1102.174408] PGD 0 P4D 0
> > [ 1102.175136] Oops: 0000 [#1] PREEMPT SMP
> > [ 1102.176293] CPU: 2 PID: 909 Comm: kworker/2:1H Not tainted 5.8.0-rc2-dgc+ #2469
> > [ 1102.178395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 1102.180762] Workqueue: xfs-log/pmem0 xlog_ioend_work
> > [ 1102.182286] RIP: 0010:check_preempt_wakeup+0xc8/0x1e0
> > [ 1102.183804] Code: 39 c2 75 f2 89 d0 39 d0 7d 20 83 ea 01 4d 8b a4 24 48 01 00 00 39 d0 75 f1 eb 0f 48 8b 9b 48 01 00 00 4d 8b a4 24 48 01 00 00 <48> 8b bb 50 01 00 00 49 39 bc 24 b
> > [ 1102.189125] RSP: 0018:ffffc9000071cea0 EFLAGS: 00010006
> > [ 1102.190625] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822305a0
> > [ 1102.192737] RDX: 0000000000000000 RSI: ffff88853337cd80 RDI: ffff88853ea2a940
> > [ 1102.194827] RBP: ffffc9000071ced8 R08: ffffffff822305a0 R09: ffff88853ec2b2d0
> > [ 1102.196886] R10: ffff88800f74b010 R11: ffff88853ec2a970 R12: 0000000000000000
> > [ 1102.199040] R13: ffff88853ea2a8c0 R14: 0000000000000001 R15: ffff88853e3b0000
> > [ 1102.200883] FS:  0000000000000000(0000) GS:ffff88853ea00000(0000) knlGS:0000000000000000
> > [ 1102.203306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1102.205024] CR2: 0000000000000150 CR3: 00000000ae7b5004 CR4: 0000000000060ee0
> > [ 1102.207117] Call Trace:
> > [ 1102.207895]  <IRQ>
> > [ 1102.208500]  ? enqueue_task_fair+0x1d7/0x9f0
> > [ 1102.209709]  check_preempt_curr+0x74/0x80
> > [ 1102.210931]  ttwu_do_wakeup+0x1e/0x170
> > [ 1102.212064]  ttwu_do_activate+0x5b/0x70
> > [ 1102.213225]  sched_ttwu_pending+0x94/0xe0
> > [ 1102.214410]  flush_smp_call_function_queue+0xf1/0x190
> > [ 1102.215885]  generic_smp_call_function_single_interrupt+0x13/0x20
> > [ 1102.217790]  __sysvec_call_function_single+0x2b/0xe0
> > [ 1102.219375]  asm_call_on_stack+0xf/0x20
> > [ 1102.220599]  </IRQ>
> > [ 1102.221280]  sysvec_call_function_single+0x7e/0x90
> > [ 1102.222854]  asm_sysvec_call_function_single+0x12/0x20
> 
> https://git.kernel.org/tip/964ed98b075263faabe416eeebac99a9bef3f06c
> 
> Should be headed to Linus soon.

Testing it now.

Observation from the outside:

"However I'm having trouble convincing myself that's actually
possible on x86_64.... "

This scheduler code has fallen off a really high ledge on the memory
barrier cliff, hasn't it?

Having looked at this code over the past 24 hours and the recent
history, I know that understanding it - let alone debugging and
fixing problem in it - is way beyond my capabilities.  And I say
that as an experienced kernel developer with a pretty good grasp of
concurrent programming and a record of implementing a fair number of
non-trivial lockless algorithms over the years....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-26 22:32   ` Dave Chinner
@ 2020-06-27 18:30     ` Peter Zijlstra
  2020-06-29 23:55       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-06-27 18:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Sat, Jun 27, 2020 at 08:32:54AM +1000, Dave Chinner wrote:
> Observation from the outside:
> 
> "However I'm having trouble convincing myself that's actually
> possible on x86_64.... "

Using the weaker rules of LKMM (as relevant to Power) I could in fact
make it happen, the 'problem' is that it's being observed on the much
stronger x86_64.

So possibly I did overlook a more 'sensible' scenario, but I'm pretty
confident the problem holds as it fully explains the failure mode.

> This scheduler code has fallen off a really high ledge on the memory
> barrier cliff, hasn't it?

Just a wee bit.. I did need pen and paper and a fair amount of
scribbling for this one.

> Having looked at this code over the past 24 hours and the recent
> history, I know that understanding it - let alone debugging and
> fixing problem in it - is way beyond my capabilities.  And I say
> that as an experienced kernel developer with a pretty good grasp of
> concurrent programming and a record of implementing a fair number of
> non-trivial lockless algorithms over the years....

All in the name of making it go fast, I suppose. It used to be much
simpler... like much of the kernel.

The biggest problem I had with this thing was that the reproduction case
we had (Paul's rcutorture) wouldn't readily trigger on my machines
(altough it did, but at a much lower rate, just twice in a week's worth
of runtime).

Also; I'm sure you can spot a problem in the I/O layer much faster than
I possibly could :-)

Anyway, let me know if you still observe any problems.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-27 18:30     ` Peter Zijlstra
@ 2020-06-29 23:55       ` Dave Chinner
  2020-06-30  8:57         ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-06-29 23:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Sat, Jun 27, 2020 at 08:30:42PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 27, 2020 at 08:32:54AM +1000, Dave Chinner wrote:
> > Observation from the outside:
> > 
> > "However I'm having trouble convincing myself that's actually
> > possible on x86_64.... "
> 
> Using the weaker rules of LKMM (as relevant to Power) I could in fact
> make it happen, the 'problem' is that it's being observed on the much
> stronger x86_64.

Yes, I understand just enough about the LKMM(*) for this statement
to scare the crap out of me. :(

(*) "understand just enough" == write litmus tests to attempt to
validate memory barrier constructs I dream up....

> > Having looked at this code over the past 24 hours and the recent
> > history, I know that understanding it - let alone debugging and
> > fixing problem in it - is way beyond my capabilities.  And I say
> > that as an experienced kernel developer with a pretty good grasp
> > of concurrent programming and a record of implementing a fair
> > number of non-trivial lockless algorithms over the years....
> 
> All in the name of making it go fast, I suppose. It used to be
> much simpler... like much of the kernel.

Yup, and we're creating our own code maintenance nightmare as we go.

> The biggest problem I had with this thing was that the reproduction case
> we had (Paul's rcutorture) wouldn't readily trigger on my machines
> (altough it did, but at a much lower rate, just twice in a week's worth
> of runtime).
> 
> Also; I'm sure you can spot a problem in the I/O layer much faster than
> I possibly could :-)

Sure, but that misses the point I was making.

I regularly have to look deep into other subsystems to work out what
problem the filesystem is tripping over. I'm regularly
looking into parts of the IO stack, memory management, page
allocators, locking and atomics, workqueues, the scheduler, etc
because XFS makes extensive (and complex) use of the infrastructure
they provide. That means to debug filesystem issues, I have to be
able to understand what that infrastructure is trying to do and make
judgements as to whether that code behaving correctly or not.

And so when I find a reproducer for a bug that takes 20s to
reproduce and it points me at code that I honestily have no hope of
understanding well enough to determine if it is working correctly or
not, then we have a problem.  A lot of my time is spent doing root
cause analysis proving that such issues are -not- filesystem
problems (they just had "xfs" in the stack trace), hence being able
to read and understand the code in related core subsystems is
extremely important to performing my day job.

If more kernel code falls off the memory barrier cliff like this,
then the ability of people like me to find the root cause of complex
issues is going to be massively reduced. Writing code so smart that
almost no-one else can understand has always been a bad thing, and
memory barriers only make this problem worse... :(

> Anyway, let me know if you still observe any problems.

Seems to be solid so far. Thanks Peter!

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-29 23:55       ` Dave Chinner
@ 2020-06-30  8:57         ` Peter Zijlstra
  2020-07-01  2:26           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-06-30  8:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Tue, Jun 30, 2020 at 09:55:33AM +1000, Dave Chinner wrote:
> Sure, but that misses the point I was making.
> 
> I regularly have to look deep into other subsystems to work out what
> problem the filesystem is tripping over. I'm regularly
> looking into parts of the IO stack, memory management, page
> allocators, locking and atomics, workqueues, the scheduler, etc
> because XFS makes extensive (and complex) use of the infrastructure
> they provide. That means to debug filesystem issues, I have to be
> able to understand what that infrastructure is trying to do and make
> judgements as to whether that code behaving correctly or not.
> 
> And so when I find a reproducer for a bug that takes 20s to
> reproduce and it points me at code that I honestily have no hope of

20s would've been nice to have a week and a half ago, the reproduce I
debugged this with took days to trigger.. a well, such is life.

> understanding well enough to determine if it is working correctly or
> not, then we have a problem.  A lot of my time is spent doing root
> cause analysis proving that such issues are -not- filesystem
> problems (they just had "xfs" in the stack trace), hence being able
> to read and understand the code in related core subsystems is
> extremely important to performing my day job.
> 
> If more kernel code falls off the memory barrier cliff like this,
> then the ability of people like me to find the root cause of complex
> issues is going to be massively reduced. Writing code so smart that
> almost no-one else can understand has always been a bad thing, and
> memory barriers only make this problem worse... :(

How about you try and give me a hint about where you gave up and I'll
try and write better comments?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-06-30  8:57         ` Peter Zijlstra
@ 2020-07-01  2:26           ` Dave Chinner
  2020-07-01  8:02             ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-07-01  2:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Tue, Jun 30, 2020 at 10:57:32AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2020 at 09:55:33AM +1000, Dave Chinner wrote:
> > Sure, but that misses the point I was making.
> > 
> > I regularly have to look deep into other subsystems to work out what
> > problem the filesystem is tripping over. I'm regularly
> > looking into parts of the IO stack, memory management, page
> > allocators, locking and atomics, workqueues, the scheduler, etc
> > because XFS makes extensive (and complex) use of the infrastructure
> > they provide. That means to debug filesystem issues, I have to be
> > able to understand what that infrastructure is trying to do and make
> > judgements as to whether that code behaving correctly or not.
> > 
> > And so when I find a reproducer for a bug that takes 20s to
> > reproduce and it points me at code that I honestily have no hope of
> 
> 20s would've been nice to have a week and a half ago, the reproduce I
> debugged this with took days to trigger.. a well, such is life.
> 
> > understanding well enough to determine if it is working correctly or
> > not, then we have a problem.  A lot of my time is spent doing root
> > cause analysis proving that such issues are -not- filesystem
> > problems (they just had "xfs" in the stack trace), hence being able
> > to read and understand the code in related core subsystems is
> > extremely important to performing my day job.
> > 
> > If more kernel code falls off the memory barrier cliff like this,
> > then the ability of people like me to find the root cause of complex
> > issues is going to be massively reduced. Writing code so smart that
> > almost no-one else can understand has always been a bad thing, and
> > memory barriers only make this problem worse... :(
> 
> How about you try and give me a hint about where you gave up and I'll
> try and write better comments?

Hard to explain. Background: we (XFS developers) got badly burnt a
few years back by memory barrier bugs in rwsems that we could not
prove were memory barrier bugs in rwsems as instrumentation made
them go away. But they were most definitely bugs in rwsems that the
maintainers said could not exist. While I could read the the rwsem
code and understand how it was supposed to work, identifying the
missing memory barrier in the code was beyond my capability. It was
also beyond the capability of the people who wrote the code, too.

As such, I'm extremely sceptical that maintainers actually
understand their complex ordering constructs as well as they think
they do, and your comments about "can't explain how it can happen on
x86-64" do nothing but re-inforce my scepticism.

To your specific question: I gave up looking at the code when I
realised I had no idea what the relationships between objects and
logic that the memory barriers were ordering against, nor what
fields within objects were protected by locks, acquire/release
depedencies, explicit memory barriers, some combination of all
three or even something subtle enough that I hadn't noticed yet.

Yes, the code explains the ordering constraints between
object A and object B, but there's nothing to to actually explain
what, say, p->on_cpu means, what it's valid states are, when teh
different values mean, and how it relates to, say, p->on_rq...

e.g. take this comment in ttwu:

       /*
         * Ensure we load p->on_rq _after_ p->state, otherwise it would
         * be possible to, falsely, observe p->on_rq == 0 and get stuck
         * in smp_cond_load_acquire() below.
         *
         * sched_ttwu_pending()                 try_to_wake_up()
         *   STORE p->on_rq = 1                   LOAD p->state
         *   UNLOCK rq->lock
         *
         * __schedule() (switch to task 'p')
         *   LOCK rq->lock                        smp_rmb();
         *   smp_mb__after_spinlock();
         *   UNLOCK rq->lock
         *
         * [task p]
         *   STORE p->state = UNINTERRUPTIBLE     LOAD p->on_rq
         *
         * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
         * __schedule().  See the comment for smp_mb__after_spinlock().
         *
         * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
         */
        smp_rmb();

The comment explains -what the code is ordering against-. I
understand that this is a modified message passing pattern that
requires explicit memory barriers because there isn't a direct
release/acquire relationship between store and load of the different
objects.

I then spent an hour learning about smp_cond_load_acquire() because this
is the first time I'd seen that construct and I had no idea what it did.
But then I understood what that specific set of ordering constraints
actually does.

And then I completely didn't understand this code, because the code
the comment references is this:

	smp_cond_load_acquire(&p->on_cpu, !VAL);

that's *on_cpu* that the load is being done from, but the comment
that references the load_acquire is talking about ordering *on_rq*
against p->state.

So after thinking I understood what the code is doing, I realise
either the comment is wrong or *I don't have a clue what this code
is doing*. And that I have no way of finding out which it might be
from looking at the code.

Contrast that to code that just uses basic locks.

Putting the little pieces of critical sections together is
relatively straight forward to do. The code inside the locks is what
is being serialised and memory barriers are unlock-to-lock, and so
assumptions about how entire objects interact can be made and in
general they are going to be valid because critical sections are
straight forward and obvious.

For more complex locking constructs - nesting of locks, different
locks protecting different parts of the same objects, etc , we end
up documenting what fields in major objects are protected by which
lock and what order the locks nest in. e.g. in fs/inode.c:

/*
 * Inode locking rules:
 *
 * inode->i_lock protects:
 *   inode->i_state, inode->i_hash, __iget()
 * Inode LRU list locks protect:
 *   inode->i_sb->s_inode_lru, inode->i_lru
 * inode->i_sb->s_inode_list_lock protects:
 *   inode->i_sb->s_inodes, inode->i_sb_list
 * bdi->wb.list_lock protects:
 *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
 * inode_hash_lock protects:
 *   inode_hashtable, inode->i_hash
 *
 * Lock ordering:
 *
 * inode->i_sb->s_inode_list_lock
 *   inode->i_lock
 *     Inode LRU list locks
 *
 * bdi->wb.list_lock
 *   inode->i_lock
 *
 * inode_hash_lock
 *   inode->i_sb->s_inode_list_lock
 *   inode->i_lock
 *
 * iunique_lock
 *   inode_hash_lock
 */

There's nothing like this in the scheduler code that I can find that
explains the expected overall ordering/serialisation mechanisms and
relationships used in the subsystem. Hence when I comes across
something that doesn't appear to make sense, there's nothign I can
refer to that would explain whether it is a bug or whether the
comment is wrong or whether I've just completely misunderstood what
the comment is referring to.

Put simply: the little details are great, but they aren't sufficient
by themselves to understand the relationships being maintained
between the various objects.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-07-01  2:26           ` Dave Chinner
@ 2020-07-01  8:02             ` Peter Zijlstra
  2020-07-01 11:06               ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-07-01  8:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Wed, Jul 01, 2020 at 12:26:46PM +1000, Dave Chinner wrote:

> There's nothing like this in the scheduler code that I can find that
> explains the expected overall ordering/serialisation mechanisms and
> relationships used in the subsystem. Hence when I comes across
> something that doesn't appear to make sense, there's nothign I can
> refer to that would explain whether it is a bug or whether the
> comment is wrong or whether I've just completely misunderstood what
> the comment is referring to.
> 
> Put simply: the little details are great, but they aren't sufficient
> by themselves to understand the relationships being maintained
> between the various objects.

You're absolutely right, we lack that. As a very first draft / brain
dump, I wrote the below, I'll try and polish it up and add to it over
the next few days.


---
 kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 75 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d3d2d67f398..568f7ade9a09 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,74 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+
+/*
+ * Serialization rules:
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       hrtimer_clock_base::cpu_base->lock
+ *
+ *  rq1->lock
+ *    rq2->lock  where: &rq1 < &rq2
+ *
+ * Regular state:
+ *
+ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
+ * local CPU's rq->lock, it optionally removes the task from the runqueue and
+ * always looks at the local rq data structures to find the most elegible task
+ * to run next.
+ *
+ * Task enqueue is also under rq->lock, possibly taken from another CPU.
+ * Wakeups from another LLC domain might use an IPI to transfer the enqueue
+ * to the local CPU to avoid bouncing the runqueue state around.
+ *
+ * Task wakeup, specifically wakeups that involve migration, are horribly
+ * complicated to avoid having to take two rq->locks.
+ *
+ * Special state:
+ *
+ * p->state    <- TASK_*
+ * p->on_cpu   <- { 0, 1 }
+ * p->on_rq    <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
+ * task_cpu(p) <- set_task_cpu()
+ *
+ * System-calls and anything external will use task_rq_lock() which acquires
+ * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
+ * stable while holding either lock.
+ *
+ * p->state is changed locklessly using set_current_state(),
+ * __set_current_state() or set_special_state(), see their respective comments.
+ *
+ * p->on_cpu is set by prepare_task() and cleared by finish_task() such that it
+ * will be set before p is scheduled-in and cleared after p is scheduled-out,
+ * both under rq->lock. Non-zero indicates the task is 'current' on it's CPU.
+ *
+ * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
+ * rq->lock. Non-zero indicates the task is runnable, the special
+ * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
+ * It indicates task_cpu() is not stable, see *task_rq_lock().
+ *
+ * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
+ *
+ *  - don't call set_task_cpu() on a blocked task
+ *
+ *  - for try_to_wake_up(), called under p->pi_lock
+ *
+ *  - for migration called under rq->lock:
+ *    o move_queued_task()
+ *    o __migrate_swap_task()
+ *    o detach_task()
+ *
+ *  - for migration called under double_rq_lock():
+ *    o push_rt_task() / pull_rt_task()
+ *    o push_dl_task() / pull_dl_task()
+ *    o dl_task_offline_migration()
+ *
+ */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -1466,8 +1534,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 {
 	lockdep_assert_held(&rq->lock);
 
-	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
-	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
 	rq_unlock(rq, rf);
 
@@ -1475,8 +1542,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 
 	rq_lock(rq, rf);
 	BUG_ON(task_cpu(p) != new_cpu);
-	enqueue_task(rq, p, 0);
-	p->on_rq = TASK_ON_RQ_QUEUED;
+	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
 
 	return rq;
@@ -3134,8 +3200,12 @@ static inline void prepare_task(struct task_struct *next)
 	/*
 	 * Claim the task as running, we do this before switching to it
 	 * such that any running task will have this set.
+	 *
+	 * __schedule()'s rq->lock and smp_mb__after_spin_lock() orders this
+	 * store against prior state change of p, also see try_to_wake_up(),
+	 * specifically smp_load_acquire(&p->on_cpu).
 	 */
-	next->on_cpu = 1;
+	WRITE_ONCE(next->on_cpu, 1);
 #endif
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-07-01  8:02             ` Peter Zijlstra
@ 2020-07-01 11:06               ` Dave Chinner
  2020-07-01 13:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-07-01 11:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Wed, Jul 01, 2020 at 10:02:13AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 12:26:46PM +1000, Dave Chinner wrote:
> 
> > There's nothing like this in the scheduler code that I can find that
> > explains the expected overall ordering/serialisation mechanisms and
> > relationships used in the subsystem. Hence when I comes across
> > something that doesn't appear to make sense, there's nothign I can
> > refer to that would explain whether it is a bug or whether the
> > comment is wrong or whether I've just completely misunderstood what
> > the comment is referring to.
> > 
> > Put simply: the little details are great, but they aren't sufficient
> > by themselves to understand the relationships being maintained
> > between the various objects.
> 
> You're absolutely right, we lack that. As a very first draft / brain
> dump, I wrote the below, I'll try and polish it up and add to it over
> the next few days.

Excellent first pass, Peter. It explained a lot of little things I
was missing. :)

Comments in line below.

> 
> 
> ---
>  kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 75 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d3d2d67f398..568f7ade9a09 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -77,6 +77,74 @@ __read_mostly int scheduler_running;
>   */
>  int sysctl_sched_rt_runtime = 950000;
>  
> +
> +/*
> + * Serialization rules:
> + *
> + * Lock order:
> + *
> + *   p->pi_lock
> + *     rq->lock
> + *       hrtimer_clock_base::cpu_base->lock

So these are accessed by the task_rq_lock() and rq_lock()
interfaces, right? And we take clock locks inside these because we
need to update timestamps/task clocks when tasks are first
scheduled?

What are all the rq_*pin_[un]lock() calls and how do they
factor into this?

> + *
> + *  rq1->lock
> + *    rq2->lock  where: &rq1 < &rq2
> + *

Ok, I'm guessing this is when task migration is being performed?

> + * Regular state:
> + *
> + * Normal scheduling state is serialized by rq->lock. __schedule() takes the
> + * local CPU's rq->lock, it optionally removes the task from the runqueue and
> + * always looks at the local rq data structures to find the most elegible task
> + * to run next.
> + *
> + * Task enqueue is also under rq->lock, possibly taken from another CPU.
> + * Wakeups from another LLC domain might use an IPI to transfer the enqueue
> + * to the local CPU to avoid bouncing the runqueue state around.

That's ttwu_queue_wakelist()?

I'm a little confused as to where ttwu_remote() fits into remote
wakeups, though. That piece of the puzzle hasn't dropped into place
yet...

> + * Task wakeup, specifically wakeups that involve migration, are horribly
> + * complicated to avoid having to take two rq->locks.
> + *
> + * Special state:
> + *
> + * p->state    <- TASK_*
> + * p->on_cpu   <- { 0, 1 }
> + * p->on_rq    <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
> + * task_cpu(p) <- set_task_cpu()
> + *
> + * System-calls and anything external will use task_rq_lock() which acquires
> + * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
> + * stable while holding either lock.

OK. Might be worthwhile iterating the objects that have this "stable
with one lock, modified under both locks" structures...

> + *
> + * p->state is changed locklessly using set_current_state(),
> + * __set_current_state() or set_special_state(), see their respective comments.

/me goes code spelunking

Ok, so those comments talk about "using a barrier" to order p->state
changes against external logic to avoid wakeup races, not about how
they are ordered or used internally. 

But it also mentions that the barrier matches with a full barrier in
wake_up_state() and that calls straight into ttwu(). Ah, so that's
what this "CONDITION" is - it's the external wakeup checks!

        /*                                                                       
         * If we are going to wake up a thread waiting for CONDITION we          
         * need to ensure that CONDITION=1 done by the caller can not be         
         * reordered with p->state check below. This pairs with mb() in          
         * set_current_state() the waiting thread does.                          
         */                                                                      
        raw_spin_lock_irqsave(&p->pi_lock, flags);                               
        smp_mb__after_spinlock();                                                
        if (!(p->state & state))                                                 
                goto unlock;

My brain now connects this straight to well known task sleep/wakeup
code patterns, and this bit now makes a lot more sense....

> + * p->on_cpu is set by prepare_task() and cleared by finish_task() such that it
> + * will be set before p is scheduled-in and cleared after p is scheduled-out,
> + * both under rq->lock. Non-zero indicates the task is 'current' on it's CPU.
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This. 1000x this. That's exactly the sort of basic, high level
relationship I've been unable to pull out of the code. Forest,
trees, all that jive. Lots of disconnected fragments fell into place
when I read that.

Thank you! :)

> + * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
> + * rq->lock. Non-zero indicates the task is runnable, the special
> + * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
> + * It indicates task_cpu() is not stable, see *task_rq_lock().
> + *
> + * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
> + *
> + *  - don't call set_task_cpu() on a blocked task

why? It's good to document the rule, but I haven't been able to find
an obvious explanation of why this rule exists....

> + *
> + *  - for try_to_wake_up(), called under p->pi_lock

What's the reason for it being under different locks if it's already
in runnable state? i.e. the lock that needs to be held is spelled
out on set_task_cpu():

        * The caller should hold either p->pi_lock or rq->lock, when changing
	*a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.

But there's no explanation there or here of why the different locks
need to be used or how the two different situations don't
overlap/contend/race because they aren't holding the same locks. So
while I know rule exists, I don't understand the relationship
or logic that the rule emerges from.

> + *
> + *  - for migration called under rq->lock:
> + *    o move_queued_task()
> + *    o __migrate_swap_task()
> + *    o detach_task()
> + *
> + *  - for migration called under double_rq_lock():
> + *    o push_rt_task() / pull_rt_task()
> + *    o push_dl_task() / pull_dl_task()
> + *    o dl_task_offline_migration()

I've not dug deeply enough into the migration code in recent times
to be able to comment sanely on these...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
  2020-07-01 11:06               ` Dave Chinner
@ 2020-07-01 13:42                 ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-07-01 13:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-xfs, Ingo Molnar

On Wed, Jul 01, 2020 at 09:06:44PM +1000, Dave Chinner wrote:

> > +/*
> > + * Serialization rules:
> > + *
> > + * Lock order:
> > + *
> > + *   p->pi_lock
> > + *     rq->lock
> > + *       hrtimer_clock_base::cpu_base->lock
> 
> So these are accessed by the task_rq_lock() and rq_lock()
> interfaces, right? And we take clock locks inside these because we
> need to update timestamps/task clocks when tasks are first
> scheduled?

That's the hrtimer_start(), which we use for bandwidth control.
sched_clock() is lockless.

> What are all the rq_*pin_[un]lock() calls and how do they
> factor into this?

Ah, that's a lockdep annotation to ensure the rq->lock doesn't get
unlocked by accident. We've had a bunch of races over the years where
code accidentally unlocked rq->lock.

Basically lockdep_pin_lock() allows you to 'pin' the lock and it will
return a cookie. Any unlock of a pinned lock will generate a warning.
lockdep_unpin_lock() requires the previously provided cookie, or again
it will complain.

Not sure where to put words on that, but I'll find something.

> > + *
> > + *  rq1->lock
> > + *    rq2->lock  where: &rq1 < &rq2
> > + *
> 
> Ok, I'm guessing this is when task migration is being performed?

Yeah.

> > + * Regular state:
> > + *
> > + * Normal scheduling state is serialized by rq->lock. __schedule() takes the
> > + * local CPU's rq->lock, it optionally removes the task from the runqueue and
> > + * always looks at the local rq data structures to find the most elegible task
> > + * to run next.
> > + *
> > + * Task enqueue is also under rq->lock, possibly taken from another CPU.
> > + * Wakeups from another LLC domain might use an IPI to transfer the enqueue
> > + * to the local CPU to avoid bouncing the runqueue state around.
> 
> That's ttwu_queue_wakelist()?

Yes, added a reference to that.

> I'm a little confused as to where ttwu_remote() fits into remote
> wakeups, though. That piece of the puzzle hasn't dropped into place
> yet...

Ah, so try_to_wake_up() is about another task/context changing our
p->state.

Assume:

  for (;;) {
	set_current_task(TASK_UNINTERRUPTIBLE)

	if (COND)
		break;

	schedule();
  }
  __set_current_task(TASK_RUNNING);

Now, suppose another CPU/IRQ/etc.. does a wakeup between
set_current_task() and schedule(). At that point our @p is still a
running/runnable task, ttwu_remote() deals with this case.

I'll improve the ttwu_remote() comment.

> > + * Task wakeup, specifically wakeups that involve migration, are horribly
> > + * complicated to avoid having to take two rq->locks.
> > + *
> > + * Special state:
> > + *
> > + * p->state    <- TASK_*
> > + * p->on_cpu   <- { 0, 1 }
> > + * p->on_rq    <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
> > + * task_cpu(p) <- set_task_cpu()
> > + *
> > + * System-calls and anything external will use task_rq_lock() which acquires
> > + * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
> > + * stable while holding either lock.
> 
> OK. Might be worthwhile iterating the objects that have this "stable
> with one lock, modified under both locks" structures...

Agreed, that was on the todo list. I called out p->cpus_ptr because
that's directly used in the ttwu() code, but there's more. I'll dig them
all out.

> > + *
> > + * p->state is changed locklessly using set_current_state(),
> > + * __set_current_state() or set_special_state(), see their respective comments.
> 
> /me goes code spelunking
> 
> Ok, so those comments talk about "using a barrier" to order p->state
> changes against external logic to avoid wakeup races, not about how
> they are ordered or used internally.

Correct. Although we also make internal use of these memory barriers, us
being frugal folk :-) You'll find it referenced in the ttwu() ordering
comments, also some architectural code (switch_to()) relies on it.

> But it also mentions that the barrier matches with a full barrier in
> wake_up_state() and that calls straight into ttwu(). Ah, so that's
> what this "CONDITION" is - it's the external wakeup checks!
>
>         /*
>          * If we are going to wake up a thread waiting for CONDITION we
>          * need to ensure that CONDITION=1 done by the caller can not be
>          * reordered with p->state check below. This pairs with mb() in
>          * set_current_state() the waiting thread does.
>          */
>         raw_spin_lock_irqsave(&p->pi_lock, flags);
>         smp_mb__after_spinlock();
>         if (!(p->state & state))
>                 goto unlock;
> 
> My brain now connects this straight to well known task sleep/wakeup
> code patterns, and this bit now makes a lot more sense....

Just so, I'll see if I can make the comments with
set_current_state()/try_to_wake_up()/__schedule() more consistent with
one another. They're all trying to describe much the same as they
co-operate to make that sleep/wakeup pattern work.

> > + * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
> > + * rq->lock. Non-zero indicates the task is runnable, the special
> > + * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
> > + * It indicates task_cpu() is not stable, see *task_rq_lock().
> > + *
> > + * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
> > + *
> > + *  - don't call set_task_cpu() on a blocked task
> 
> why? It's good to document the rule, but I haven't been able to find
> an obvious explanation of why this rule exists....

Ah, I forgot if there was a techinical reason for it, but the conceptual
reason is that if a task isn't running, it's CPU assignment is
irrelevant, it doesn't matter what CPU you're not running on.

So I created rules where we only care about the CPU assignment for
runnable tasks. I'll add this.

(it makes hotplug easier, we don't have to even care if the cpu
assignment is even possible)

> > + *
> > + *  - for try_to_wake_up(), called under p->pi_lock
> 
> What's the reason for it being under different locks if it's already
> in runnable state? i.e. the lock that needs to be held is spelled
> out on set_task_cpu():
> 
>         * The caller should hold either p->pi_lock or rq->lock, when changing
> 	*a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
> 
> But there's no explanation there or here of why the different locks
> need to be used or how the two different situations don't
> overlap/contend/race because they aren't holding the same locks. So
> while I know rule exists, I don't understand the relationship
> or logic that the rule emerges from.

There's a hint here:

+ * Task wakeup, specifically wakeups that involve migration, are horribly
+ * complicated to avoid having to take two rq->locks.

So ttwu() used to first acquire the 'old' rq->lock, then do
CPU-selection and if new-cpu != old-cpu (a fairly common case) do
set_task_cpu(), drop the old rq->lock, acquire new rq->lock and do the
enqueue. Or something along those lines.

It was found that the initial rq->lock in case of migration added
significantly to the rq->lock contention for a number of workloads, so
we made things complicated...

I'll go explain it better.

Thanks Dave!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-01 13:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26  0:47 [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS Dave Chinner
2020-06-26  5:57 ` Dave Chinner
2020-06-26  7:33 ` Peter Zijlstra
2020-06-26 22:32   ` Dave Chinner
2020-06-27 18:30     ` Peter Zijlstra
2020-06-29 23:55       ` Dave Chinner
2020-06-30  8:57         ` Peter Zijlstra
2020-07-01  2:26           ` Dave Chinner
2020-07-01  8:02             ` Peter Zijlstra
2020-07-01 11:06               ` Dave Chinner
2020-07-01 13:42                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).