linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: fix a crash in do_task_dead()
@ 2019-05-29 20:25 Qian Cai
  2019-05-29 20:31 ` Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Qian Cai @ 2019-05-29 20:25 UTC (permalink / raw)
  To: axboe
  Cc: akpm, hch, peterz, oleg, gkohli, mingo, linux-mm, linux-kernel, Qian Cai

The commit 0619317ff8ba ("block: add polled wakeup task helper")
replaced wake_up_process() with blk_wake_io_task() in
end_swap_bio_read() which triggers a crash when running heavy swapping
workloads.

[T114538] kernel BUG at kernel/sched/core.c:3462!
[T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
[T114538] Call trace:
[T114538]  do_task_dead+0xf0/0xf8
[T114538]  do_exit+0xd5c/0x10fc
[T114538]  do_group_exit+0xf4/0x110
[T114538]  get_signal+0x280/0xdd8
[T114538]  do_notify_resume+0x720/0x968
[T114538]  work_pending+0x8/0x10

This is because shortly after set_special_state(TASK_DEAD),
end_swap_bio_read() is called from an interrupt handler that revive the
task state to TASK_RUNNING causes __schedule() to return and trip the
BUG() later.

[  C206] Call trace:
[  C206]  dump_backtrace+0x0/0x268
[  C206]  show_stack+0x20/0x2c
[  C206]  dump_stack+0xb4/0x108
[  C206]  blk_wake_io_task+0x7c/0x80
[  C206]  end_swap_bio_read+0x22c/0x31c
[  C206]  bio_endio+0x3d8/0x414
[  C206]  dec_pending+0x280/0x378 [dm_mod]
[  C206]  clone_endio+0x128/0x2ac [dm_mod]
[  C206]  bio_endio+0x3d8/0x414
[  C206]  blk_update_request+0x3ac/0x924
[  C206]  scsi_end_request+0x54/0x350
[  C206]  scsi_io_completion+0xf0/0x6f4
[  C206]  scsi_finish_command+0x214/0x228
[  C206]  scsi_softirq_done+0x170/0x1a4
[  C206]  blk_done_softirq+0x100/0x194
[  C206]  __do_softirq+0x350/0x790
[  C206]  irq_exit+0x200/0x26c
[  C206]  handle_IPI+0x2e8/0x514
[  C206]  gic_handle_irq+0x224/0x228
[  C206]  el1_irq+0xb8/0x140
[  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
[  C206]  do_task_dead+0x88/0xf8
[  C206]  do_exit+0xd5c/0x10fc
[  C206]  do_group_exit+0xf4/0x110
[  C206]  get_signal+0x280/0xdd8
[  C206]  do_notify_resume+0x720/0x968
[  C206]  work_pending+0x8/0x10

Before the offensive commit, wake_up_process() will prevent this from
happening by taking the pi_lock and bail out immediately if TASK_DEAD is
set.

if (!(p->state & TASK_NORMAL))
	goto out;

Fix it by calling wake_up_process() if it is in a non-task context.

Fixes: 0619317ff8ba ("block: add polled wakeup task helper")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/blkdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 592669bcc536..290eb7528f54 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1803,7 +1803,7 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
 	 * that case, we don't need to signal a wakeup, it's enough to just
 	 * mark us as RUNNING.
 	 */
-	if (waiter == current)
+	if (waiter == current && in_task())
 		__set_current_state(TASK_RUNNING);
 	else
 		wake_up_process(waiter);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-29 20:25 [PATCH] block: fix a crash in do_task_dead() Qian Cai
@ 2019-05-29 20:31 ` Jens Axboe
  2019-05-30  8:03 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-05-29 20:31 UTC (permalink / raw)
  To: Qian Cai; +Cc: akpm, hch, peterz, oleg, gkohli, mingo, linux-mm, linux-kernel

On 5/29/19 2:25 PM, Qian Cai wrote:
> The commit 0619317ff8ba ("block: add polled wakeup task helper")
> replaced wake_up_process() with blk_wake_io_task() in
> end_swap_bio_read() which triggers a crash when running heavy swapping
> workloads.
> 
> [T114538] kernel BUG at kernel/sched/core.c:3462!
> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
> [T114538] Call trace:
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
> 
> This is because shortly after set_special_state(TASK_DEAD),
> end_swap_bio_read() is called from an interrupt handler that revive the
> task state to TASK_RUNNING causes __schedule() to return and trip the
> BUG() later.
> 
> [  C206] Call trace:
> [  C206]  dump_backtrace+0x0/0x268
> [  C206]  show_stack+0x20/0x2c
> [  C206]  dump_stack+0xb4/0x108
> [  C206]  blk_wake_io_task+0x7c/0x80
> [  C206]  end_swap_bio_read+0x22c/0x31c
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  dec_pending+0x280/0x378 [dm_mod]
> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  blk_update_request+0x3ac/0x924
> [  C206]  scsi_end_request+0x54/0x350
> [  C206]  scsi_io_completion+0xf0/0x6f4
> [  C206]  scsi_finish_command+0x214/0x228
> [  C206]  scsi_softirq_done+0x170/0x1a4
> [  C206]  blk_done_softirq+0x100/0x194
> [  C206]  __do_softirq+0x350/0x790
> [  C206]  irq_exit+0x200/0x26c
> [  C206]  handle_IPI+0x2e8/0x514
> [  C206]  gic_handle_irq+0x224/0x228
> [  C206]  el1_irq+0xb8/0x140
> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
> [  C206]  do_task_dead+0x88/0xf8
> [  C206]  do_exit+0xd5c/0x10fc
> [  C206]  do_group_exit+0xf4/0x110
> [  C206]  get_signal+0x280/0xdd8
> [  C206]  do_notify_resume+0x720/0x968
> [  C206]  work_pending+0x8/0x10
> 
> Before the offensive commit, wake_up_process() will prevent this from
> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
> set.
> 
> if (!(p->state & TASK_NORMAL))
> 	goto out;
> 
> Fix it by calling wake_up_process() if it is in a non-task context.

I like this one a lot better than the previous fix. Unless folks
object, I'll queue this up for 5.2, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-29 20:25 [PATCH] block: fix a crash in do_task_dead() Qian Cai
  2019-05-29 20:31 ` Jens Axboe
@ 2019-05-30  8:03 ` Peter Zijlstra
  2019-05-31 21:12   ` Jens Axboe
  2019-05-30 11:15 ` Oleg Nesterov
  2019-07-04 16:03 ` [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous Oleg Nesterov
  3 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2019-05-30  8:03 UTC (permalink / raw)
  To: Qian Cai; +Cc: axboe, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On Wed, May 29, 2019 at 04:25:26PM -0400, Qian Cai wrote:

> Fixes: 0619317ff8ba ("block: add polled wakeup task helper")

What is the purpose of that patch ?! The Changelog doesn't mention any
benefit or performance gain. So why not revert that?

> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  include/linux/blkdev.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 592669bcc536..290eb7528f54 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1803,7 +1803,7 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
>  	 * that case, we don't need to signal a wakeup, it's enough to just
>  	 * mark us as RUNNING.
>  	 */
> -	if (waiter == current)
> +	if (waiter == current && in_task())
>  		__set_current_state(TASK_RUNNING);

NAK, No that's broken too.

The right fix is something like:

	if (waiter == current) {
		barrier();
		if (current->state & TASK_NORAL)
			__set_current_state(TASK_RUNNING);
	}

But even that is yuck to do outside of the scheduler code, as it looses
tracepoints and stats.

So can we please just revert that original patch and start over -- if
needed?

>  	else
>  		wake_up_process(waiter);
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-29 20:25 [PATCH] block: fix a crash in do_task_dead() Qian Cai
  2019-05-29 20:31 ` Jens Axboe
  2019-05-30  8:03 ` Peter Zijlstra
@ 2019-05-30 11:15 ` Oleg Nesterov
  2019-05-31 21:10   ` Jens Axboe
  2019-07-04 16:03 ` [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous Oleg Nesterov
  3 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2019-05-30 11:15 UTC (permalink / raw)
  To: Qian Cai; +Cc: axboe, akpm, hch, peterz, gkohli, mingo, linux-mm, linux-kernel

On 05/29, Qian Cai wrote:
>
> The commit 0619317ff8ba ("block: add polled wakeup task helper")
> replaced wake_up_process() with blk_wake_io_task() in
> end_swap_bio_read() which triggers a crash when running heavy swapping
> workloads.
> 
> [T114538] kernel BUG at kernel/sched/core.c:3462!
> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
> [T114538] Call trace:
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
> 
> This is because shortly after set_special_state(TASK_DEAD),
> end_swap_bio_read() is called from an interrupt handler that revive the
> task state to TASK_RUNNING causes __schedule() to return and trip the
> BUG() later.
> 
> [  C206] Call trace:
> [  C206]  dump_backtrace+0x0/0x268
> [  C206]  show_stack+0x20/0x2c
> [  C206]  dump_stack+0xb4/0x108
> [  C206]  blk_wake_io_task+0x7c/0x80
> [  C206]  end_swap_bio_read+0x22c/0x31c
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  dec_pending+0x280/0x378 [dm_mod]
> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  blk_update_request+0x3ac/0x924
> [  C206]  scsi_end_request+0x54/0x350
> [  C206]  scsi_io_completion+0xf0/0x6f4
> [  C206]  scsi_finish_command+0x214/0x228
> [  C206]  scsi_softirq_done+0x170/0x1a4
> [  C206]  blk_done_softirq+0x100/0x194
> [  C206]  __do_softirq+0x350/0x790
> [  C206]  irq_exit+0x200/0x26c
> [  C206]  handle_IPI+0x2e8/0x514
> [  C206]  gic_handle_irq+0x224/0x228
> [  C206]  el1_irq+0xb8/0x140
> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
> [  C206]  do_task_dead+0x88/0xf8
> [  C206]  do_exit+0xd5c/0x10fc
> [  C206]  do_group_exit+0xf4/0x110
> [  C206]  get_signal+0x280/0xdd8
> [  C206]  do_notify_resume+0x720/0x968
> [  C206]  work_pending+0x8/0x10
> 
> Before the offensive commit, wake_up_process() will prevent this from
> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
> set.
> 
> if (!(p->state & TASK_NORMAL))
> 	goto out;

I don't understand this code at all but I am just curious, can we do
something like incomplete patch below ?

Oleg.

--- x/mm/page_io.c
+++ x/mm/page_io.c
@@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
 	bio_put(bio);
-	blk_wake_io_task(waiter);
-	put_task_struct(waiter);
+	if (waiter) {
+		blk_wake_io_task(waiter);
+		put_task_struct(waiter);
+	}
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
 	 * Keep this task valid during swap readpage because the oom killer may
 	 * attempt to access it in the page fault retry time check.
 	 */
-	get_task_struct(current);
-	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	if (synchronous)
+	if (synchronous) {
 		bio->bi_opf |= REQ_HIPRI;
+		get_task_struct(current);
+		bio->bi_private = current;
+	}
 	count_vm_event(PSWPIN);
 	bio_get(bio);
 	qc = submit_bio(bio);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-30 11:15 ` Oleg Nesterov
@ 2019-05-31 21:10   ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-05-31 21:10 UTC (permalink / raw)
  To: Oleg Nesterov, Qian Cai
  Cc: akpm, hch, peterz, gkohli, mingo, linux-mm, linux-kernel

On 5/30/19 5:15 AM, Oleg Nesterov wrote:
> On 05/29, Qian Cai wrote:
>>
>> The commit 0619317ff8ba ("block: add polled wakeup task helper")
>> replaced wake_up_process() with blk_wake_io_task() in
>> end_swap_bio_read() which triggers a crash when running heavy swapping
>> workloads.
>>
>> [T114538] kernel BUG at kernel/sched/core.c:3462!
>> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
>> [T114538] Call trace:
>> [T114538]  do_task_dead+0xf0/0xf8
>> [T114538]  do_exit+0xd5c/0x10fc
>> [T114538]  do_group_exit+0xf4/0x110
>> [T114538]  get_signal+0x280/0xdd8
>> [T114538]  do_notify_resume+0x720/0x968
>> [T114538]  work_pending+0x8/0x10
>>
>> This is because shortly after set_special_state(TASK_DEAD),
>> end_swap_bio_read() is called from an interrupt handler that revive the
>> task state to TASK_RUNNING causes __schedule() to return and trip the
>> BUG() later.
>>
>> [  C206] Call trace:
>> [  C206]  dump_backtrace+0x0/0x268
>> [  C206]  show_stack+0x20/0x2c
>> [  C206]  dump_stack+0xb4/0x108
>> [  C206]  blk_wake_io_task+0x7c/0x80
>> [  C206]  end_swap_bio_read+0x22c/0x31c
>> [  C206]  bio_endio+0x3d8/0x414
>> [  C206]  dec_pending+0x280/0x378 [dm_mod]
>> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
>> [  C206]  bio_endio+0x3d8/0x414
>> [  C206]  blk_update_request+0x3ac/0x924
>> [  C206]  scsi_end_request+0x54/0x350
>> [  C206]  scsi_io_completion+0xf0/0x6f4
>> [  C206]  scsi_finish_command+0x214/0x228
>> [  C206]  scsi_softirq_done+0x170/0x1a4
>> [  C206]  blk_done_softirq+0x100/0x194
>> [  C206]  __do_softirq+0x350/0x790
>> [  C206]  irq_exit+0x200/0x26c
>> [  C206]  handle_IPI+0x2e8/0x514
>> [  C206]  gic_handle_irq+0x224/0x228
>> [  C206]  el1_irq+0xb8/0x140
>> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
>> [  C206]  do_task_dead+0x88/0xf8
>> [  C206]  do_exit+0xd5c/0x10fc
>> [  C206]  do_group_exit+0xf4/0x110
>> [  C206]  get_signal+0x280/0xdd8
>> [  C206]  do_notify_resume+0x720/0x968
>> [  C206]  work_pending+0x8/0x10
>>
>> Before the offensive commit, wake_up_process() will prevent this from
>> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
>> set.
>>
>> if (!(p->state & TASK_NORMAL))
>> 	goto out;
> 
> I don't understand this code at all but I am just curious, can we do
> something like incomplete patch below ?
> 
> Oleg.
> 
> --- x/mm/page_io.c
> +++ x/mm/page_io.c
> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>   	unlock_page(page);
>   	WRITE_ONCE(bio->bi_private, NULL);
>   	bio_put(bio);
> -	blk_wake_io_task(waiter);
> -	put_task_struct(waiter);
> +	if (waiter) {
> +		blk_wake_io_task(waiter);
> +		put_task_struct(waiter);
> +	}
>   }
>   
>   int generic_swapfile_activate(struct swap_info_struct *sis,
> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>   	 * Keep this task valid during swap readpage because the oom killer may
>   	 * attempt to access it in the page fault retry time check.
>   	 */
> -	get_task_struct(current);
> -	bio->bi_private = current;
>   	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -	if (synchronous)
> +	if (synchronous) {
>   		bio->bi_opf |= REQ_HIPRI;
> +		get_task_struct(current);
> +		bio->bi_private = current;
> +	}
>   	count_vm_event(PSWPIN);
>   	bio_get(bio);
>   	qc = submit_bio(bio);

I think this would solve it for swap.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-30  8:03 ` Peter Zijlstra
@ 2019-05-31 21:12   ` Jens Axboe
  2019-06-03 12:37     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2019-05-31 21:12 UTC (permalink / raw)
  To: Peter Zijlstra, Qian Cai
  Cc: akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 04:25:26PM -0400, Qian Cai wrote:
> 
>> Fixes: 0619317ff8ba ("block: add polled wakeup task helper")
> 
> What is the purpose of that patch ?! The Changelog doesn't mention any
> benefit or performance gain. So why not revert that?

Yeah that is actually pretty weak. There are substantial performance
gains for small IOs using this trick, the changelog should have
included those. I guess that was left on the list...

>> Signed-off-by: Qian Cai <cai@lca.pw>
>> ---
>>   include/linux/blkdev.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 592669bcc536..290eb7528f54 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1803,7 +1803,7 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
>>   	 * that case, we don't need to signal a wakeup, it's enough to just
>>   	 * mark us as RUNNING.
>>   	 */
>> -	if (waiter == current)
>> +	if (waiter == current && in_task())
>>   		__set_current_state(TASK_RUNNING);
> 
> NAK, No that's broken too.
> 
> The right fix is something like:
> 
> 	if (waiter == current) {
> 		barrier();
> 		if (current->state & TASK_NORAL)
> 			__set_current_state(TASK_RUNNING);
> 	}
> 
> But even that is yuck to do outside of the scheduler code, as it looses
> tracepoints and stats.
> 
> So can we please just revert that original patch and start over -- if
> needed?

How about we just use your above approach? It looks fine to me (except
the obvious typo). I'd hate to give up this gain, in the realm of
fighting against stupid kernel offload solutions we need every cycle we
can get.

I know it's not super kosher, your patch, but I don't think it's that
bad hidden in a generic helper.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-05-31 21:12   ` Jens Axboe
@ 2019-06-03 12:37     ` Peter Zijlstra
  2019-06-03 12:44       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Peter Zijlstra @ 2019-06-03 12:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
> On 5/30/19 2:03 AM, Peter Zijlstra wrote:

> > What is the purpose of that patch ?! The Changelog doesn't mention any
> > benefit or performance gain. So why not revert that?
> 
> Yeah that is actually pretty weak. There are substantial performance
> gains for small IOs using this trick, the changelog should have
> included those. I guess that was left on the list...

OK. I've looked at the try_to_wake_up() path for these exact
conditions and we're certainly sub-optimal there, and I think we can put
much of this special case in there. Please see below.

> I know it's not super kosher, your patch, but I don't think it's that
> bad hidden in a generic helper.

How about the thing that Oleg proposed? That is, not set a waiter when
we know the loop is polling? That would avoid the need for this
alltogether, it would also avoid any set_current_state() on the wait
side of things.

Anyway, Oleg, do you see anything blatantly buggered with this patch?

(the stats were already dodgy for rq-stats, this patch makes them dodgy
for task-stats too)

---
 kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..474aa4c8e9d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
+	if (p == current) {
+		/*
+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+		 * == smp_processor_id()'. Together this means we can special
+		 * case the whole 'p->on_rq && ttwu_remote()' case below
+		 * without taking any locks.
+		 *
+		 * In particular:
+		 *  - we rely on Program-Order guarantees for all the ordering,
+		 *  - we're serialized against set_special_state() by virtue of
+		 *    it disabling IRQs (this allows not taking ->pi_lock).
+		 */
+		if (!(p->state & state))
+			goto out;
+
+		success = 1;
+		trace_sched_waking(p);
+		p->state = TASK_RUNNING;
+		trace_sched_woken(p);
+		goto out;
+	}
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
@@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	smp_mb__after_spinlock();
 	if (!(p->state & state))
-		goto out;
+		goto unlock;
 
 	trace_sched_waking(p);
 
@@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+		goto unlock;
 
 #ifdef CONFIG_SMP
 	/*
@@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
-stat:
-	ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
-	return success;
+out:
+	if (success) {
+		ttwu_stat(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-03 12:37     ` Peter Zijlstra
@ 2019-06-03 12:44       ` Peter Zijlstra
  2019-06-03 16:09         ` Oleg Nesterov
  2019-06-03 16:23       ` Jens Axboe
  2019-06-05 15:04       ` Jens Axboe
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2019-06-03 12:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On Mon, Jun 03, 2019 at 02:37:05PM +0200, Peter Zijlstra wrote:

> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)

It now also has concurrency on wakeup; but afaict that's harmless, we'll
get racing stores of p->state = TASK_RUNNING, much the same as if there
was a remote wakeup vs a wait-loop terminating early.

I suppose the tracepoint consumers might have to deal with some
artifacts there, but that's their problem.

> ---
>  kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..474aa4c8e9d2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	unsigned long flags;
>  	int cpu, success = 0;
>  
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			goto out;
> +
> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_woken(p);
> +		goto out;
> +	}
> +
>  	/*
>  	 * If we are going to wake up a thread waiting for CONDITION we
>  	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  	smp_mb__after_spinlock();
>  	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>  
>  	trace_sched_waking(p);
>  
> @@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 */
>  	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>  
>  #ifdef CONFIG_SMP
>  	/*
> @@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  #endif /* CONFIG_SMP */
>  
>  	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  
> -	return success;
> +out:
> +	if (success) {
> +		ttwu_stat(p, cpu, wake_flags);
> +		return true;
> +	}
> +
> +	return false;
>  }
>  
>  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-03 12:44       ` Peter Zijlstra
@ 2019-06-03 16:09         ` Oleg Nesterov
  2019-06-03 16:19           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2019-06-03 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Qian Cai, akpm, hch, gkohli, mingo, linux-mm, linux-kernel

On 06/03, Peter Zijlstra wrote:
>
> It now also has concurrency on wakeup; but afaict that's harmless, we'll
> get racing stores of p->state = TASK_RUNNING, much the same as if there
> was a remote wakeup vs a wait-loop terminating early.
>
> I suppose the tracepoint consumers might have to deal with some
> artifacts there, but that's their problem.

I guess you mean that trace_sched_waking/wakeup can be reported twice if
try_to_wake_up(current) races with ttwu_remote(). And ttwu_stat().

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	unsigned long flags;
> >  	int cpu, success = 0;
> >  
> > +	if (p == current) {
> > +		/*
> > +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> > +		 * == smp_processor_id()'. Together this means we can special
> > +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> > +		 * without taking any locks.
> > +		 *
> > +		 * In particular:
> > +		 *  - we rely on Program-Order guarantees for all the ordering,
> > +		 *  - we're serialized against set_special_state() by virtue of
> > +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> > +		 */
> > +		if (!(p->state & state))
> > +			goto out;
> > +
> > +		success = 1;
> > +		trace_sched_waking(p);
> > +		p->state = TASK_RUNNING;
> > +		trace_sched_woken(p);
                ^^^^^^^^^^^^^^^^^
trace_sched_wakeup(p) ?

I see nothing wrong... but probably this is because I don't fully understand
this change. In particular, I don't really understand who else can benefit from
this optimization...

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-03 16:09         ` Oleg Nesterov
@ 2019-06-03 16:19           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2019-06-03 16:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jens Axboe, Qian Cai, akpm, hch, gkohli, mingo, linux-mm, linux-kernel

On Mon, Jun 03, 2019 at 06:09:53PM +0200, Oleg Nesterov wrote:
> On 06/03, Peter Zijlstra wrote:
> >
> > It now also has concurrency on wakeup; but afaict that's harmless, we'll
> > get racing stores of p->state = TASK_RUNNING, much the same as if there
> > was a remote wakeup vs a wait-loop terminating early.
> >
> > I suppose the tracepoint consumers might have to deal with some
> > artifacts there, but that's their problem.
> 
> I guess you mean that trace_sched_waking/wakeup can be reported twice if
> try_to_wake_up(current) races with ttwu_remote(). And ttwu_stat().

Right, one local one remote, and you get them things twice.

> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > >  	unsigned long flags;
> > >  	int cpu, success = 0;
> > >  
> > > +	if (p == current) {
> > > +		/*
> > > +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> > > +		 * == smp_processor_id()'. Together this means we can special
> > > +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> > > +		 * without taking any locks.
> > > +		 *
> > > +		 * In particular:
> > > +		 *  - we rely on Program-Order guarantees for all the ordering,
> > > +		 *  - we're serialized against set_special_state() by virtue of
> > > +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> > > +		 */
> > > +		if (!(p->state & state))
> > > +			goto out;
> > > +
> > > +		success = 1;
> > > +		trace_sched_waking(p);
> > > +		p->state = TASK_RUNNING;
> > > +		trace_sched_woken(p);
>                 ^^^^^^^^^^^^^^^^^
> trace_sched_wakeup(p) ?

Uhm,, yah.

> I see nothing wrong... but probably this is because I don't fully understand
> this change. In particular, I don't really understand who else can benefit from
> this optimization...

Pretty much every wait-loop, where the wakeup happens from IRQ context
on the same CPU, before we've hit schedule().

Now, I've no idea if that's many, but I much prefer to keep this magic
inside try_to_wake_up() than spreading it around.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-03 12:37     ` Peter Zijlstra
  2019-06-03 12:44       ` Peter Zijlstra
@ 2019-06-03 16:23       ` Jens Axboe
  2019-06-05 15:04       ` Jens Axboe
  2 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-06-03 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On 6/3/19 6:37 AM, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
>> On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> 
>>> What is the purpose of that patch ?! The Changelog doesn't mention any
>>> benefit or performance gain. So why not revert that?
>>
>> Yeah that is actually pretty weak. There are substantial performance
>> gains for small IOs using this trick, the changelog should have
>> included those. I guess that was left on the list...
> 
> OK. I've looked at the try_to_wake_up() path for these exact
> conditions and we're certainly sub-optimal there, and I think we can put
> much of this special case in there. Please see below.
> 
>> I know it's not super kosher, your patch, but I don't think it's that
>> bad hidden in a generic helper.
> 
> How about the thing that Oleg proposed? That is, not set a waiter when
> we know the loop is polling? That would avoid the need for this
> alltogether, it would also avoid any set_current_state() on the wait
> side of things.
> 
> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)
> 
> ---
>   kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
>   1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..474aa4c8e9d2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	unsigned long flags;
>   	int cpu, success = 0;
>   
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			goto out;
> +
> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_woken(p);
> +		goto out;
> +	}
> +
>   	/*
>   	 * If we are going to wake up a thread waiting for CONDITION we
>   	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	raw_spin_lock_irqsave(&p->pi_lock, flags);
>   	smp_mb__after_spinlock();
>   	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>   
>   	trace_sched_waking(p);
>   
> @@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	 */
>   	smp_rmb();
>   	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>   
>   #ifdef CONFIG_SMP
>   	/*
> @@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   #endif /* CONFIG_SMP */
>   
>   	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>   	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>   
> -	return success;
> +out:
> +	if (success) {
> +		ttwu_stat(p, cpu, wake_flags);
> +		return true;
> +	}
> +
> +	return false;
>   }
>   
>   /**

Let me run some tests with this vs mainline vs blk wakeup hack removed.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-03 12:37     ` Peter Zijlstra
  2019-06-03 12:44       ` Peter Zijlstra
  2019-06-03 16:23       ` Jens Axboe
@ 2019-06-05 15:04       ` Jens Axboe
  2019-06-07 13:35         ` Peter Zijlstra
  2019-06-30 23:06         ` Hugh Dickins
  2 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2019-06-05 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On 6/3/19 6:37 AM, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
>> On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> 
>>> What is the purpose of that patch ?! The Changelog doesn't mention any
>>> benefit or performance gain. So why not revert that?
>>
>> Yeah that is actually pretty weak. There are substantial performance
>> gains for small IOs using this trick, the changelog should have
>> included those. I guess that was left on the list...
> 
> OK. I've looked at the try_to_wake_up() path for these exact
> conditions and we're certainly sub-optimal there, and I think we can put
> much of this special case in there. Please see below.
> 
>> I know it's not super kosher, your patch, but I don't think it's that
>> bad hidden in a generic helper.
> 
> How about the thing that Oleg proposed? That is, not set a waiter when
> we know the loop is polling? That would avoid the need for this
> alltogether, it would also avoid any set_current_state() on the wait
> side of things.
> 
> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)

Tested this patch, looks good to me. Made the trace change to make it
compile, and also moved the cpu = task_cpu() assignment earlier to
avoid uninitialized use of that variable.

How about the following plan - if folks are happy with this sched patch,
we can queue it up for 5.3. Once that is in, I'll kill the block change
that special cases the polled task wakeup. For 5.2, we go with Oleg's
patch for the swap case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-05 15:04       ` Jens Axboe
@ 2019-06-07 13:35         ` Peter Zijlstra
  2019-06-07 14:23           ` Peter Zijlstra
  2019-06-30 23:06         ` Hugh Dickins
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2019-06-07 13:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> How about the following plan - if folks are happy with this sched patch,
> we can queue it up for 5.3. Once that is in, I'll kill the block change
> that special cases the polled task wakeup. For 5.2, we go with Oleg's
> patch for the swap case.

OK, works for me. I'll go write a proper patch.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-07 13:35         ` Peter Zijlstra
@ 2019-06-07 14:23           ` Peter Zijlstra
  2019-06-08  8:39             ` Jens Axboe
  2019-06-10 13:13             ` Gaurav Kohli
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2019-06-07 14:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> > How about the following plan - if folks are happy with this sched patch,
> > we can queue it up for 5.3. Once that is in, I'll kill the block change
> > that special cases the polled task wakeup. For 5.2, we go with Oleg's
> > patch for the swap case.
> 
> OK, works for me. I'll go write a proper patch.

I now have the below; I'll queue that after the long weekend and let
0-day chew on it for a while and then push it out to tip or something.


---
Subject: sched: Optimize try_to_wake_up() for local wakeups
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 7 15:39:49 CEST 2019

Jens reported that significant performance can be had on some block
workloads (XXX numbers?) by special casing local wakeups. That is,
wakeups on the current task before it schedules out. Given something
like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Cc: Qian Cai <cai@lca.pw>
Cc: mingo@redhat.com
Cc: akpm@linux-foundation.org
Cc: hch@lst.de
Cc: gkohli@codeaurora.org
Cc: oleg@redhat.com
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
 	unsigned long flags;
 	int cpu, success = 0;
 
+	if (p == current) {
+		/*
+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+		 * == smp_processor_id()'. Together this means we can special
+		 * case the whole 'p->on_rq && ttwu_remote()' case below
+		 * without taking any locks.
+		 *
+		 * In particular:
+		 *  - we rely on Program-Order guarantees for all the ordering,
+		 *  - we're serialized against set_special_state() by virtue of
+		 *    it disabling IRQs (this allows not taking ->pi_lock).
+		 */
+		if (!(p->state & state))
+			return false;
+
+		success = 1;
+		trace_sched_waking(p);
+		p->state = TASK_RUNNING;
+		trace_sched_wakeup(p);
+		goto out;
+	}
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
@@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	smp_mb__after_spinlock();
 	if (!(p->state & state))
-		goto out;
+		goto unlock;
 
 	trace_sched_waking(p);
 
@@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+		goto unlock;
 
 #ifdef CONFIG_SMP
 	/*
@@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
-stat:
-	ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+out:
+	if (success)
+		ttwu_stat(p, cpu, wake_flags);
 
 	return success;
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-07 14:23           ` Peter Zijlstra
@ 2019-06-08  8:39             ` Jens Axboe
  2019-06-10 13:13             ` Gaurav Kohli
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-06-08  8:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qian Cai, akpm, hch, oleg, gkohli, mingo, linux-mm, linux-kernel

On 6/7/19 8:23 AM, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
>>> How about the following plan - if folks are happy with this sched patch,
>>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>>> patch for the swap case.
>>
>> OK, works for me. I'll go write a proper patch.
> 
> I now have the below; I'll queue that after the long weekend and let
> 0-day chew on it for a while and then push it out to tip or something.

LGTM, thanks Peter!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-07 14:23           ` Peter Zijlstra
  2019-06-08  8:39             ` Jens Axboe
@ 2019-06-10 13:13             ` Gaurav Kohli
  2019-06-10 14:46               ` Oleg Nesterov
  1 sibling, 1 reply; 28+ messages in thread
From: Gaurav Kohli @ 2019-06-10 13:13 UTC (permalink / raw)
  To: Peter Zijlstra, Jens Axboe
  Cc: Qian Cai, akpm, hch, oleg, mingo, linux-mm, linux-kernel



On 6/7/2019 7:53 PM, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
>>> How about the following plan - if folks are happy with this sched patch,
>>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>>> patch for the swap case.
>>
>> OK, works for me. I'll go write a proper patch.
> 
> I now have the below; I'll queue that after the long weekend and let
> 0-day chew on it for a while and then push it out to tip or something.
> 
> 
> ---
> Subject: sched: Optimize try_to_wake_up() for local wakeups
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Jun 7 15:39:49 CEST 2019
> 
> Jens reported that significant performance can be had on some block
> workloads (XXX numbers?) by special casing local wakeups. That is,
> wakeups on the current task before it schedules out. Given something
> like the normal wait pattern:
> 
> 	for (;;) {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 
> 		if (cond)
> 			break;
> 
> 		schedule();
> 	}
> 	__set_current_state(TASK_RUNNING);
> 
> Any wakeup (on this CPU) after set_current_state() and before
> schedule() would benefit from this.
> 
> Normal wakeups take p->pi_lock, which serializes wakeups to the same
> task. By eliding that we gain concurrency on:
> 
>   - ttwu_stat(); we already had concurrency on rq stats, this now also
>     brings it to task stats. -ENOCARE
> 
>   - tracepoints; it is now possible to get multiple instances of
>     trace_sched_waking() (and possibly trace_sched_wakeup()) for the
>     same task. Tracers will have to learn to cope.
> 
> Furthermore, p->pi_lock is used by set_special_state(), to order
> against TASK_RUNNING stores from other CPUs. But since this is
> strictly CPU local, we don't need the lock, and set_special_state()'s
> disabling of IRQs is sufficient.
> 
> After the normal wakeup takes p->pi_lock it issues
> smp_mb__after_spinlock(), in order to ensure the woken task must
> observe prior stores before we observe the p->state. If this is CPU
> local, this will be satisfied with a compiler barrier, and we rely on
> try_to_wake_up() being a funcation call, which implies such.
> 
> Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
> would continue into the ttwu_remote() branch, which normally is
> concerned with exactly this wakeup scenario, except from a remote CPU.
> IOW we're waking a task that is still running. In this case, we can
> trivially avoid taking rq->lock, all that's left from this is to set
> p->state.
> 
> This then yields an extremely simple and fast path for 'p == current'.
> 
> Cc: Qian Cai <cai@lca.pw>
> Cc: mingo@redhat.com
> Cc: akpm@linux-foundation.org
> Cc: hch@lst.de
> Cc: gkohli@codeaurora.org
> Cc: oleg@redhat.com
> Reported-by: Jens Axboe <axboe@kernel.dk>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/core.c |   33 ++++++++++++++++++++++++++++-----
>   1 file changed, 28 insertions(+), 5 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
>   	unsigned long flags;
>   	int cpu, success = 0;
>   
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			return false;
> +

Hi Peter, Jen,

As we are not taking pi_lock here , is there possibility of same task 
dead call comes as this point of time for current thread, bcoz of which 
we have seen earlier issue after this commit 0619317ff8ba
[T114538]  do_task_dead+0xf0/0xf8
[T114538]  do_exit+0xd5c/0x10fc
[T114538]  do_group_exit+0xf4/0x110
[T114538]  get_signal+0x280/0xdd8
[T114538]  do_notify_resume+0x720/0x968
[T114538]  work_pending+0x8/0x10

Is there a chance of TASK_DEAD set at this point of time?


> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_wakeup(p);
> +		goto out;
> +	}
> +
>   	/*
>   	 * If we are going to wake up a thread waiting for CONDITION we
>   	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
>   	raw_spin_lock_irqsave(&p->pi_lock, flags);
>   	smp_mb__after_spinlock();
>   	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>   
>   	trace_sched_waking(p);
>   
> @@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
>   	 */
>   	smp_rmb();
>   	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>   
>   #ifdef CONFIG_SMP
>   	/*
> @@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
>   #endif /* CONFIG_SMP */
>   
>   	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>   	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +out:
> +	if (success)
> +		ttwu_stat(p, cpu, wake_flags);
>   
>   	return success;
>   }
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-10 13:13             ` Gaurav Kohli
@ 2019-06-10 14:46               ` Oleg Nesterov
  2019-06-11  4:39                 ` Gaurav Kohli
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2019-06-10 14:46 UTC (permalink / raw)
  To: Gaurav Kohli
  Cc: Peter Zijlstra, Jens Axboe, Qian Cai, akpm, hch, mingo, linux-mm,
	linux-kernel

On 06/10, Gaurav Kohli wrote:
>
> >@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
> >  	unsigned long flags;
> >  	int cpu, success = 0;
> >+	if (p == current) {
> >+		/*
> >+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> >+		 * == smp_processor_id()'. Together this means we can special
> >+		 * case the whole 'p->on_rq && ttwu_remote()' case below
> >+		 * without taking any locks.
> >+		 *
> >+		 * In particular:
> >+		 *  - we rely on Program-Order guarantees for all the ordering,
> >+		 *  - we're serialized against set_special_state() by virtue of
> >+		 *    it disabling IRQs (this allows not taking ->pi_lock).
> >+		 */
> >+		if (!(p->state & state))
> >+			return false;
> >+
>
> Hi Peter, Jen,
>
> As we are not taking pi_lock here , is there possibility of same task dead
> call comes as this point of time for current thread, bcoz of which we have
> seen earlier issue after this commit 0619317ff8ba
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
>
> Is there a chance of TASK_DEAD set at this point of time?

In this case try_to_wake_up(current, TASK_NORMAL) will do nothing, see the
if (!(p->state & state)) above.

See also the comment about set_special_state() above. It disables irqs and
this is enough to ensure that try_to_wake_up(current) from irq can't race
with set_special_state(TASK_DEAD).

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-10 14:46               ` Oleg Nesterov
@ 2019-06-11  4:39                 ` Gaurav Kohli
  0 siblings, 0 replies; 28+ messages in thread
From: Gaurav Kohli @ 2019-06-11  4:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Jens Axboe, Qian Cai, akpm, hch, mingo, linux-mm,
	linux-kernel


>>> +
>>
>> Hi Peter, Jen,
>>
>> As we are not taking pi_lock here , is there possibility of same task dead
>> call comes as this point of time for current thread, bcoz of which we have
>> seen earlier issue after this commit 0619317ff8ba
>> [T114538]  do_task_dead+0xf0/0xf8
>> [T114538]  do_exit+0xd5c/0x10fc
>> [T114538]  do_group_exit+0xf4/0x110
>> [T114538]  get_signal+0x280/0xdd8
>> [T114538]  do_notify_resume+0x720/0x968
>> [T114538]  work_pending+0x8/0x10
>>
>> Is there a chance of TASK_DEAD set at this point of time?
> 
> In this case try_to_wake_up(current, TASK_NORMAL) will do nothing, see the
> if (!(p->state & state)) above.
> 
> See also the comment about set_special_state() above. It disables irqs and
> this is enough to ensure that try_to_wake_up(current) from irq can't race
> with set_special_state(TASK_DEAD).

Thanks Oleg,

I missed that part(both thread and interrupt is in same core only), So 
that situation would never come.
> 
> Oleg.
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-05 15:04       ` Jens Axboe
  2019-06-07 13:35         ` Peter Zijlstra
@ 2019-06-30 23:06         ` Hugh Dickins
  2019-07-01 14:22           ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2019-06-30 23:06 UTC (permalink / raw)
  To: Jens Axboe, Oleg Nesterov
  Cc: Peter Zijlstra, Qian Cai, akpm, hch, gkohli, mingo, linux-mm,
	linux-kernel

On Wed, 5 Jun 2019, Jens Axboe wrote:
> 
> How about the following plan - if folks are happy with this sched patch,
> we can queue it up for 5.3. Once that is in, I'll kill the block change
> that special cases the polled task wakeup. For 5.2, we go with Oleg's
> patch for the swap case.

I just hit the do_task_dead() kernel BUG at kernel/sched/core.c:3463!
while heavy swapping on 5.2-rc7: it looks like Oleg's patch intended
for 5.2 was not signed off, and got forgotten.

I did hit the do_task_dead() BUG (but not at all easily) on early -rcs
before seeing Oleg's patch, then folded it in and and didn't hit the BUG
again; then just tried again without it, and luckily hit in a few hours.

So I can give it an enthusiastic
Acked-by: Hugh Dickins <hughd@google.com>
because it makes good sense to avoid the get/blk_wake/put overhead on
the asynch path anyway, even if it didn't work around a bug; but only
Half-Tested-by: Hugh Dickins <hughd@google.com>
since I have not been exercising the synchronous path at all.

Hugh, requoting Oleg:

> 
> I don't understand this code at all but I am just curious, can we do
> something like incomplete patch below ?
> 
> Oleg.
> 
> --- x/mm/page_io.c
> +++ x/mm/page_io.c
> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>  	unlock_page(page);
>  	WRITE_ONCE(bio->bi_private, NULL);
>  	bio_put(bio);
> -	blk_wake_io_task(waiter);
> -	put_task_struct(waiter);
> +	if (waiter) {
> +		blk_wake_io_task(waiter);
> +		put_task_struct(waiter);
> +	}
>  }
>  
>  int generic_swapfile_activate(struct swap_info_struct *sis,
> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>  	 * Keep this task valid during swap readpage because the oom killer may
>  	 * attempt to access it in the page fault retry time check.
>  	 */
> -	get_task_struct(current);
> -	bio->bi_private = current;
>  	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -	if (synchronous)
> +	if (synchronous) {
>  		bio->bi_opf |= REQ_HIPRI;
> +		get_task_struct(current);
> +		bio->bi_private = current;
> +	}
>  	count_vm_event(PSWPIN);
>  	bio_get(bio);
>  	qc = submit_bio(bio);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-06-30 23:06         ` Hugh Dickins
@ 2019-07-01 14:22           ` Jens Axboe
  2019-07-02 22:06             ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2019-07-01 14:22 UTC (permalink / raw)
  To: Hugh Dickins, Oleg Nesterov
  Cc: Peter Zijlstra, Qian Cai, akpm, hch, gkohli, mingo, linux-mm,
	linux-kernel

On 6/30/19 5:06 PM, Hugh Dickins wrote:
> On Wed, 5 Jun 2019, Jens Axboe wrote:
>>
>> How about the following plan - if folks are happy with this sched patch,
>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>> patch for the swap case.
> 
> I just hit the do_task_dead() kernel BUG at kernel/sched/core.c:3463!
> while heavy swapping on 5.2-rc7: it looks like Oleg's patch intended
> for 5.2 was not signed off, and got forgotten.
> 
> I did hit the do_task_dead() BUG (but not at all easily) on early -rcs
> before seeing Oleg's patch, then folded it in and and didn't hit the BUG
> again; then just tried again without it, and luckily hit in a few hours.
> 
> So I can give it an enthusiastic
> Acked-by: Hugh Dickins <hughd@google.com>
> because it makes good sense to avoid the get/blk_wake/put overhead on
> the asynch path anyway, even if it didn't work around a bug; but only
> Half-Tested-by: Hugh Dickins <hughd@google.com>
> since I have not been exercising the synchronous path at all.

I'll take the blame for that, went away on vacation for 3 weeks...
But yes, for 5.2, the patch from Oleg looks fine. Once Peter's other
change is in mainline, I'll go through and remove these special cases.

Andrew, can you queue Oleg's patch for 5.2? You can also add my:

Reviewed-by: Jens Axboe <axboe@kernel.dk>

to it.

> 
> Hugh, requoting Oleg:
> 
>>
>> I don't understand this code at all but I am just curious, can we do
>> something like incomplete patch below ?
>>
>> Oleg.
>>
>> --- x/mm/page_io.c
>> +++ x/mm/page_io.c
>> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>>   	unlock_page(page);
>>   	WRITE_ONCE(bio->bi_private, NULL);
>>   	bio_put(bio);
>> -	blk_wake_io_task(waiter);
>> -	put_task_struct(waiter);
>> +	if (waiter) {
>> +		blk_wake_io_task(waiter);
>> +		put_task_struct(waiter);
>> +	}
>>   }
>>   
>>   int generic_swapfile_activate(struct swap_info_struct *sis,
>> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>>   	 * Keep this task valid during swap readpage because the oom killer may
>>   	 * attempt to access it in the page fault retry time check.
>>   	 */
>> -	get_task_struct(current);
>> -	bio->bi_private = current;
>>   	bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> -	if (synchronous)
>> +	if (synchronous) {
>>   		bio->bi_opf |= REQ_HIPRI;
>> +		get_task_struct(current);
>> +		bio->bi_private = current;
>> +	}
>>   	count_vm_event(PSWPIN);
>>   	bio_get(bio);
>>   	qc = submit_bio(bio);


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-07-01 14:22           ` Jens Axboe
@ 2019-07-02 22:06             ` Andrew Morton
  2019-07-03 17:35               ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2019-07-02 22:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hugh Dickins, Oleg Nesterov, Peter Zijlstra, Qian Cai, hch,
	gkohli, mingo, linux-mm, linux-kernel

On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> 
> Reviewed-by: Jens Axboe <axboe@kernel.dk>

Sure.  Although things are a bit of a mess.  Oleg, can we please have a
clean resend with signoffs and acks, etc?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-07-02 22:06             ` Andrew Morton
@ 2019-07-03 17:35               ` Oleg Nesterov
  2019-07-03 17:44                 ` Hugh Dickins
  2019-07-03 17:52                 ` Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Oleg Nesterov @ 2019-07-03 17:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Hugh Dickins, Peter Zijlstra, Qian Cai, hch, gkohli,
	mingo, linux-mm, linux-kernel

On 07/02, Andrew Morton wrote:

> On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
> > Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> > 
> > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> 
> Sure.  Although things are a bit of a mess.  Oleg, can we please have a
> clean resend with signoffs and acks, etc?

OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
altogether, but need to recheck.

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-07-03 17:35               ` Oleg Nesterov
@ 2019-07-03 17:44                 ` Hugh Dickins
  2019-07-04 16:00                   ` Oleg Nesterov
  2019-07-03 17:52                 ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2019-07-03 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Jens Axboe, Hugh Dickins, Peter Zijlstra,
	Qian Cai, hch, gkohli, mingo, linux-mm, linux-kernel

On Wed, 3 Jul 2019, Oleg Nesterov wrote:
> On 07/02, Andrew Morton wrote:
> > On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> > 
> > > Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> > > 
> > > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Sure.  Although things are a bit of a mess.  Oleg, can we please have a
> > clean resend with signoffs and acks, etc?
> 
> OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
> altogether, but need to recheck.

Thank you, Oleg. But, with respect, I'd caution against making it cleverer
at the last minute: what you posted already is understandable, works, has
Jen's Reviewed-by and my Acked-by: it just lacks a description and signoff.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-07-03 17:35               ` Oleg Nesterov
  2019-07-03 17:44                 ` Hugh Dickins
@ 2019-07-03 17:52                 ` Jens Axboe
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-07-03 17:52 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton
  Cc: Hugh Dickins, Peter Zijlstra, Qian Cai, hch, gkohli, mingo,
	linux-mm, linux-kernel

On 7/3/19 11:35 AM, Oleg Nesterov wrote:
> On 07/02, Andrew Morton wrote:
> 
>> On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> Andrew, can you queue Oleg's patch for 5.2? You can also add my:
>>>
>>> Reviewed-by: Jens Axboe <axboe@kernel.dk>
>>
>> Sure.  Although things are a bit of a mess.  Oleg, can we please have a
>> clean resend with signoffs and acks, etc?
> 
> OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
> altogether, but need to recheck.

I'd just send it again as-is. We're removing the blk wakeup special case
anyway for 5.3.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] block: fix a crash in do_task_dead()
  2019-07-03 17:44                 ` Hugh Dickins
@ 2019-07-04 16:00                   ` Oleg Nesterov
  0 siblings, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2019-07-04 16:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jens Axboe, Peter Zijlstra, Qian Cai, hch, gkohli,
	mingo, linux-mm, linux-kernel

On 07/03, Hugh Dickins wrote:
>
> Thank you, Oleg. But, with respect, I'd caution against making it cleverer
> at the last minute: what you posted already is understandable, works, has
> Jen's Reviewed-by and my Acked-by: it just lacks a description and signoff.

OK, agreed. I am sending that patch as-is with yours and Jen's acks applied.

Oleg.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous
  2019-05-29 20:25 [PATCH] block: fix a crash in do_task_dead() Qian Cai
                   ` (2 preceding siblings ...)
  2019-05-30 11:15 ` Oleg Nesterov
@ 2019-07-04 16:03 ` Oleg Nesterov
  2019-07-04 19:32   ` Andrew Morton
  3 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2019-07-04 16:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Qian Cai, axboe, hch, peterz, gkohli, mingo, linux-mm,
	linux-kernel, Hugh Dickins

swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
this means that the caller can get the spurious wakeup after return. This
can be fatal if blk_wake_io_task() does set_current_state(TASK_RUNNING)
after the caller does set_special_state(), in the worst case the kernel
can crash in do_task_dead().

Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/page_io.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 2e8019d..3098895 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -140,8 +140,10 @@ static void end_swap_bio_read(struct bio *bio)
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
 	bio_put(bio);
-	blk_wake_io_task(waiter);
-	put_task_struct(waiter);
+	if (waiter) {
+		blk_wake_io_task(waiter);
+		put_task_struct(waiter);
+	}
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -398,11 +400,12 @@ int swap_readpage(struct page *page, bool synchronous)
 	 * Keep this task valid during swap readpage because the oom killer may
 	 * attempt to access it in the page fault retry time check.
 	 */
-	get_task_struct(current);
-	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	if (synchronous)
+	if (synchronous) {
 		bio->bi_opf |= REQ_HIPRI;
+		get_task_struct(current);
+		bio->bi_private = current;
+	}
 	count_vm_event(PSWPIN);
 	bio_get(bio);
 	qc = submit_bio(bio);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous
  2019-07-04 16:03 ` [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous Oleg Nesterov
@ 2019-07-04 19:32   ` Andrew Morton
  2019-07-04 21:15     ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2019-07-04 19:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Qian Cai, axboe, hch, peterz, gkohli, mingo, linux-mm,
	linux-kernel, Hugh Dickins

On Thu, 4 Jul 2019 18:03:01 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
> this means that the caller can get the spurious wakeup after return. This
> can be fatal if blk_wake_io_task() does set_current_state(TASK_RUNNING)
> after the caller does set_special_state(), in the worst case the kernel
> can crash in do_task_dead().

I think we need a Fixes: and a cc:stable here?

IIRC, we're fixing 0619317ff8baa2 ("block: add polled wakeup task helper").

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous
  2019-07-04 19:32   ` Andrew Morton
@ 2019-07-04 21:15     ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2019-07-04 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Qian Cai, axboe, hch, peterz, gkohli, mingo,
	linux-mm, linux-kernel, Hugh Dickins

On Thu, 4 Jul 2019, Andrew Morton wrote:
> On Thu, 4 Jul 2019 18:03:01 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
> > this means that the caller can get the spurious wakeup after return. This
> > can be fatal if blk_wake_io_task() does set_current_state(TASK_RUNNING)
> > after the caller does set_special_state(), in the worst case the kernel
> > can crash in do_task_dead().
> 
> I think we need a Fixes: and a cc:stable here?
> 
> IIRC, we're fixing 0619317ff8baa2 ("block: add polled wakeup task helper").

Yes, you are right.

But catch me by surprise: I had been thinking this was a 5.2 regression.
I guess something in 5.2 (doesn't matter what) has made it significantly
easier to hit: but now I look at old records, see that I hit it once on
5.0-rc1, then never again until 5.2.

Thanks, and to Oleg,
Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-07-04 21:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 20:25 [PATCH] block: fix a crash in do_task_dead() Qian Cai
2019-05-29 20:31 ` Jens Axboe
2019-05-30  8:03 ` Peter Zijlstra
2019-05-31 21:12   ` Jens Axboe
2019-06-03 12:37     ` Peter Zijlstra
2019-06-03 12:44       ` Peter Zijlstra
2019-06-03 16:09         ` Oleg Nesterov
2019-06-03 16:19           ` Peter Zijlstra
2019-06-03 16:23       ` Jens Axboe
2019-06-05 15:04       ` Jens Axboe
2019-06-07 13:35         ` Peter Zijlstra
2019-06-07 14:23           ` Peter Zijlstra
2019-06-08  8:39             ` Jens Axboe
2019-06-10 13:13             ` Gaurav Kohli
2019-06-10 14:46               ` Oleg Nesterov
2019-06-11  4:39                 ` Gaurav Kohli
2019-06-30 23:06         ` Hugh Dickins
2019-07-01 14:22           ` Jens Axboe
2019-07-02 22:06             ` Andrew Morton
2019-07-03 17:35               ` Oleg Nesterov
2019-07-03 17:44                 ` Hugh Dickins
2019-07-04 16:00                   ` Oleg Nesterov
2019-07-03 17:52                 ` Jens Axboe
2019-05-30 11:15 ` Oleg Nesterov
2019-05-31 21:10   ` Jens Axboe
2019-07-04 16:03 ` [PATCH] swap_readpage: avoid blk_wake_io_task() if !synchronous Oleg Nesterov
2019-07-04 19:32   ` Andrew Morton
2019-07-04 21:15     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).