LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Valentin Schneider <valentin.schneider@arm.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Stefano Garzarella <sgarzare@redhat.com>
Subject: Re: [PATCH] sched/fair: don't NUMA balance for kthreads
Date: Tue, 26 May 2020 17:40:06 +0100
Message-ID: <jhjpnaqtztl.mognet@arm.com> (raw)
In-Reply-To: <865de121-8190-5d30-ece5-3b097dc74431@kernel.dk>


On 26/05/20 16:38, Jens Axboe wrote:
> Stefano reported a crash with using SQPOLL with io_uring:
>
> BUG: kernel NULL pointer dereference, address: 00000000000003b0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
> Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
> RIP: 0010:task_numa_work+0x4f/0x2c0
> Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> Call Trace:
>  task_work_run+0x68/0xa0
>  io_sq_thread+0x252/0x3d0
>  ? finish_wait+0x80/0x80
>  kthread+0xf9/0x130
>  ? __ia32_sys_io_uring_enter+0x370/0x370
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40
>
> which is task_numa_work() oopsing on current->mm being NULL. The task
> work is queued by task_tick_numa(), which checks if current->mm is NULL
> at the time of the call. But this state isn't necessarily persistent,
> if the kthread is using use_mm() to temporarily adopt the mm of a task.
>
> Change the task_tick_numa() check to exclude kernel threads in general,
> as it doesn't make sense to attempt ot balance for kthreads anyway.
>

Does it? (this isn't a rethorical question)

Suppose a given kthread ends up doing more accesses to some pages
(via use_mm()) than the other threads that access them, wouldn't it make
sense to take that into account when it comes to NUMA balancing?

AFAIA that's not something that really happens in the wild, and as you
say we don't have a (guaranteed) persistent state to work with in those
cases, so it's more of a pipe dream. With that in mind, I think the
proposed change is fine.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> Reported-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> This should go into 5.7 imho
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..403556fc16d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2908,7 +2908,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>       /*
>        * We don't care about NUMA placement if we don't have memory.
>        */
> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>               return;
>
>       /*

  parent reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-26 15:38 Jens Axboe
2020-05-26 16:37 ` [tip: sched/urgent] sched/fair: Don't " tip-bot2 for Jens Axboe
2020-05-26 16:40 ` Valentin Schneider [this message]
2020-05-26 20:00   ` [PATCH] sched/fair: don't " Peter Zijlstra
2020-05-26 23:42     ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jhjpnaqtztl.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=axboe@kernel.dk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sgarzare@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git