LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] sched/fair: don't NUMA balance for kthreads
@ 2020-05-26 15:38 Jens Axboe
  2020-05-26 16:37 ` [tip: sched/urgent] sched/fair: Don't " tip-bot2 for Jens Axboe
  2020-05-26 16:40 ` [PATCH] sched/fair: don't " Valentin Schneider
  0 siblings, 2 replies; 5+ messages in thread
From: Jens Axboe @ 2020-05-26 15:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, Stefano Garzarella

Stefano reported a crash with using SQPOLL with io_uring:

BUG: kernel NULL pointer dereference, address: 00000000000003b0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
RIP: 0010:task_numa_work+0x4f/0x2c0
Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
Call Trace:
 task_work_run+0x68/0xa0
 io_sq_thread+0x252/0x3d0
 ? finish_wait+0x80/0x80
 kthread+0xf9/0x130
 ? __ia32_sys_io_uring_enter+0x370/0x370
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL. The task
work is queued by task_tick_numa(), which checks if current->mm is NULL
at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

This should go into 5.7 imho

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..403556fc16d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2908,7 +2908,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
 		return;
 
 	/*

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip: sched/urgent] sched/fair: Don't NUMA balance for kthreads
  2020-05-26 15:38 [PATCH] sched/fair: don't NUMA balance for kthreads Jens Axboe
@ 2020-05-26 16:37 ` tip-bot2 for Jens Axboe
  2020-05-26 16:40 ` [PATCH] sched/fair: don't " Valentin Schneider
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot2 for Jens Axboe @ 2020-05-26 16:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Stefano Garzarella, Jens Axboe, Ingo Molnar, Peter Zijlstra, x86, LKML

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     18f855e574d9799a0e7489f8ae6fd8447d0dd74a
Gitweb:        https://git.kernel.org/tip/18f855e574d9799a0e7489f8ae6fd8447d0dd74a
Author:        Jens Axboe <axboe@kernel.dk>
AuthorDate:    Tue, 26 May 2020 09:38:31 -06:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 26 May 2020 18:34:58 +02:00

sched/fair: Don't NUMA balance for kthreads

Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538ba5d..da3e5b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2908,7 +2908,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
 		return;
 
 	/*

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/fair: don't NUMA balance for kthreads
  2020-05-26 15:38 [PATCH] sched/fair: don't NUMA balance for kthreads Jens Axboe
  2020-05-26 16:37 ` [tip: sched/urgent] sched/fair: Don't " tip-bot2 for Jens Axboe
@ 2020-05-26 16:40 ` Valentin Schneider
  2020-05-26 20:00   ` Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Valentin Schneider @ 2020-05-26 16:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Stefano Garzarella


On 26/05/20 16:38, Jens Axboe wrote:
> Stefano reported a crash with using SQPOLL with io_uring:
>
> BUG: kernel NULL pointer dereference, address: 00000000000003b0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
> Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
> RIP: 0010:task_numa_work+0x4f/0x2c0
> Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> Call Trace:
>  task_work_run+0x68/0xa0
>  io_sq_thread+0x252/0x3d0
>  ? finish_wait+0x80/0x80
>  kthread+0xf9/0x130
>  ? __ia32_sys_io_uring_enter+0x370/0x370
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40
>
> which is task_numa_work() oopsing on current->mm being NULL. The task
> work is queued by task_tick_numa(), which checks if current->mm is NULL
> at the time of the call. But this state isn't necessarily persistent,
> if the kthread is using use_mm() to temporarily adopt the mm of a task.
>
> Change the task_tick_numa() check to exclude kernel threads in general,
> as it doesn't make sense to attempt ot balance for kthreads anyway.
>

Does it? (this isn't a rethorical question)

Suppose a given kthread ends up doing more accesses to some pages
(via use_mm()) than the other threads that access them, wouldn't it make
sense to take that into account when it comes to NUMA balancing?

AFAIA that's not something that really happens in the wild, and as you
say we don't have a (guaranteed) persistent state to work with in those
cases, so it's more of a pipe dream. With that in mind, I think the
proposed change is fine.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> Reported-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> This should go into 5.7 imho
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..403556fc16d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2908,7 +2908,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>       /*
>        * We don't care about NUMA placement if we don't have memory.
>        */
> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>               return;
>
>       /*

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/fair: don't NUMA balance for kthreads
  2020-05-26 16:40 ` [PATCH] sched/fair: don't " Valentin Schneider
@ 2020-05-26 20:00   ` Peter Zijlstra
  2020-05-26 23:42     ` Valentin Schneider
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2020-05-26 20:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Jens Axboe, Ingo Molnar, linux-kernel, Stefano Garzarella

On Tue, May 26, 2020 at 05:40:06PM +0100, Valentin Schneider wrote:

> > Change the task_tick_numa() check to exclude kernel threads in general,
> > as it doesn't make sense to attempt ot balance for kthreads anyway.
> >
> 
> Does it? (this isn't a rethorical question)
> 
> Suppose a given kthread ends up doing more accesses to some pages
> (via use_mm()) than the other threads that access them, wouldn't it make
> sense to take that into account when it comes to NUMA balancing?

Well, task_tick_numa() tries and farm off a bunch of actual work to
task_work_add(), and there's so very little userspace for a kernel
thread to return to... :-)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/fair: don't NUMA balance for kthreads
  2020-05-26 20:00   ` Peter Zijlstra
@ 2020-05-26 23:42     ` Valentin Schneider
  0 siblings, 0 replies; 5+ messages in thread
From: Valentin Schneider @ 2020-05-26 23:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jens Axboe, Ingo Molnar, linux-kernel, Stefano Garzarella


On 26/05/20 21:00, Peter Zijlstra wrote:
> On Tue, May 26, 2020 at 05:40:06PM +0100, Valentin Schneider wrote:
>
>> > Change the task_tick_numa() check to exclude kernel threads in general,
>> > as it doesn't make sense to attempt ot balance for kthreads anyway.
>> >
>>
>> Does it? (this isn't a rethorical question)
>>
>> Suppose a given kthread ends up doing more accesses to some pages
>> (via use_mm()) than the other threads that access them, wouldn't it make
>> sense to take that into account when it comes to NUMA balancing?
>
> Well, task_tick_numa() tries and farm off a bunch of actual work to
> task_work_add(), and there's so very little userspace for a kernel
> thread to return to... :-)

Err, true... I did say pipe dreams!

I had only really taken note of the exit / return to userspace
callbacks, but I see io_uring has its own task_work_run() calls, which
(I think) explains how we can end up with a kthread actually running
task_numa_work().

I'm also thinking we really don't want that task_numa_work() to be left
hanging on the task_work list, because that self-looping thing will not
play nice to whatever else has been queued (which AFAICT shouldn't happen
under normal conditions, i.e. !kthreads).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-26 15:38 [PATCH] sched/fair: don't NUMA balance for kthreads Jens Axboe
2020-05-26 16:37 ` [tip: sched/urgent] sched/fair: Don't " tip-bot2 for Jens Axboe
2020-05-26 16:40 ` [PATCH] sched/fair: don't " Valentin Schneider
2020-05-26 20:00   ` Peter Zijlstra
2020-05-26 23:42     ` Valentin Schneider

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git