linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sched and membarrier probe_kernel_address fixes
@ 2019-09-03 16:00 Mathieu Desnoyers
  2019-09-03 16:00 ` [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 16:00 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers

There is an ongoing discussion [1] about the need to fix use of
probe_kernel_address in task_rcu_deference (or provide additional
existence guarantees), and add missing READ_ONCE and
probe_kernel_address when reading other cpu runqueue's
mm->membarrier_state.

This patch set simply adds the missing probe_kernel_address checks
and use, aiming to be easily backported to stable kernels. Changing
the existence guarantees of sighand and mm objects is expected to
deprecate those changes for future kernels, but it's unclear whether
those more intrusive changes will be acceptable for stable kernel
branches.

Thanks,

Mathieu

[1] https://lore.kernel.org/r/20190902162036.GS2369@hirez.programming.kicks-ass.net

Mathieu Desnoyers (3):
  Fix: sched: task_rcu_dereference: check probe_kernel_address return
    value
  Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited
  Fix: sched/membarrier: use probe_kernel_address to read
    mm->membarrier_state

 kernel/exit.c             |  3 ++-
 kernel/sched/membarrier.c | 27 +++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value
  2019-09-03 16:00 [RFC PATCH 0/3] sched and membarrier probe_kernel_address fixes Mathieu Desnoyers
@ 2019-09-03 16:00 ` Mathieu Desnoyers
  2019-09-03 16:12   ` Linus Torvalds
  2019-09-03 16:00 ` [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited Mathieu Desnoyers
  2019-09-03 16:00 ` [RFC PATCH 3/3] Fix: sched/membarrier: use probe_kernel_address to read mm->membarrier_state Mathieu Desnoyers
  2 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 16:00 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers

probe_kernel_address can return -EFAULT on error, which leads to use of
an uninitialized or partially initialized sighand variable.

There is ongoing discussion on removing task_rcu_dereference altogether,
which seems like a nice way forward. This patch is submitted as a fix
aiming to be backported to prior stable kernel releases.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..b1c3e1ba501c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -249,7 +249,8 @@ struct task_struct *task_rcu_dereference(struct task_struct **ptask)
 	if (!task)
 		return NULL;
 
-	probe_kernel_address(&task->sighand, sighand);
+	if (probe_kernel_address(&task->sighand, sighand))
+		sighand = NULL;
 
 	/*
 	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited
  2019-09-03 16:00 [RFC PATCH 0/3] sched and membarrier probe_kernel_address fixes Mathieu Desnoyers
  2019-09-03 16:00 ` [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value Mathieu Desnoyers
@ 2019-09-03 16:00 ` Mathieu Desnoyers
  2019-09-03 16:23   ` Linus Torvalds
  2019-09-03 16:00 ` [RFC PATCH 3/3] Fix: sched/membarrier: use probe_kernel_address to read mm->membarrier_state Mathieu Desnoyers
  2 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 16:00 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers

Due to the lack of READ_ONCE() on p->mm, this code can in fact turn into
a NULL deref when we hit do_exit() around exit_mm(). The first p->mm
read is before and sees !NULL, the second is after and does observe
NULL, which triggers a null pointer dereference.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/membarrier.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..02feb7c8da4f 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -72,12 +72,16 @@ static int membarrier_global_expedited(void)
 
 		rcu_read_lock();
 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
-		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
+		if (p) {
+			struct mm_struct *mm = READ_ONCE(p->mm);
+
+			if (mm && (atomic_read(&mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
-			if (!fallback)
-				__cpumask_set_cpu(cpu, tmpmask);
-			else
-				smp_call_function_single(cpu, ipi_mb, NULL, 1);
+				if (!fallback)
+					__cpumask_set_cpu(cpu, tmpmask);
+				else
+					smp_call_function_single(cpu, ipi_mb, NULL, 1);
+			}
 		}
 		rcu_read_unlock();
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] Fix: sched/membarrier: use probe_kernel_address to read mm->membarrier_state
  2019-09-03 16:00 [RFC PATCH 0/3] sched and membarrier probe_kernel_address fixes Mathieu Desnoyers
  2019-09-03 16:00 ` [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value Mathieu Desnoyers
  2019-09-03 16:00 ` [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited Mathieu Desnoyers
@ 2019-09-03 16:00 ` Mathieu Desnoyers
  2 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 16:00 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers

membarrier_global_expedited reads each runqueue current task's
mm->membarrier_state to test a flag. A concurrent task exit can cause
the memory backing the struct mm to have been freed before that read.
Therefore, use probe_kernel_address to read that flag. If the read
fails, consider it as if the flag was unset: it means the scheduler code
provides the barriers required by membarrier, and sending an IPI to this
CPU is redundant.

There is ongoing discussion on removing the need to use
probe_kernel_address from this code by providing additional existence
guarantees for struct mm. This patch is submitted as a fix aiming to be
backported to prior stable kernel releases.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/membarrier.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 02feb7c8da4f..0312604d8315 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -58,6 +58,8 @@ static int membarrier_global_expedited(void)
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct task_struct *p;
+		struct mm_struct *mm;
+		int membarrier_state;
 
 		/*
 		 * Skipping the current CPU is OK even through we can be
@@ -72,17 +74,34 @@ static int membarrier_global_expedited(void)
 
 		rcu_read_lock();
 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
-		if (p) {
-			struct mm_struct *mm = READ_ONCE(p->mm);
+		if (!p)
+			goto next;
+		mm = READ_ONCE(p->mm);
+		if (!mm)
+			goto next;
 
-			if (mm && (atomic_read(&mm->membarrier_state) &
-				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
-				if (!fallback)
-					__cpumask_set_cpu(cpu, tmpmask);
-				else
-					smp_call_function_single(cpu, ipi_mb, NULL, 1);
-			}
+		/*
+		 * Using probe_kernel_address() of membarrier_state instead of
+		 * an atomic_read() to deal with the fact that mm may have been
+		 * concurrently freed. If probe_kernel_address fails, it means
+		 * the mm struct was freed, so there is no need to issue a
+		 * barrier on that particular CPU, because the scheduler code
+		 * is taking care of it.
+		 *
+		 * It does not matter whether probe_kernel_address decides to
+		 * read membarrier_state piece-wise, given that we only care
+		 * about testing a single bit.
+		 */
+		if (probe_kernel_address(&mm->membarrier_state.counter,
+					 membarrier_state))
+			membarrier_state = 0;
+		if (membarrier_state & MEMBARRIER_STATE_GLOBAL_EXPEDITED) {
+			if (!fallback)
+				__cpumask_set_cpu(cpu, tmpmask);
+			else
+				smp_call_function_single(cpu, ipi_mb, NULL, 1);
 		}
+	next:
 		rcu_read_unlock();
 	}
 	if (!fallback) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value
  2019-09-03 16:00 ` [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value Mathieu Desnoyers
@ 2019-09-03 16:12   ` Linus Torvalds
  2019-09-03 16:56     ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-09-03 16:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing

On Tue, Sep 3, 2019 at 9:00 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> probe_kernel_address can return -EFAULT on error, which leads to use of
> an uninitialized or partially initialized sighand variable.

I think this comment and this code is actively misleading.

There is no "uninitialized or partially initialized sighand variable".
That's completely wrong.

The sighand variable is always completely initialized. It's just that
the check for "is it initialized" is _not_ the return value from
probe_kernel_address(), because that return value is simply not
sufficient.

So this is just wrong. Don't do it. You're just confusing the issue,
and you're making statments that aren't true in the commit message,
and making the code do a pointless and odd check.

If you want to change this code for legibility, you should just add a
comment above the probe_kernel_address() about why the return value is
ignored, and why the check _below_ that code verifies the value of
sighand with a different check.

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited
  2019-09-03 16:00 ` [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited Mathieu Desnoyers
@ 2019-09-03 16:23   ` Linus Torvalds
  2019-09-03 20:13     ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-09-03 16:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing

On Tue, Sep 3, 2019 at 9:00 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Due to the lack of READ_ONCE() on p->mm, this code can in fact turn into
> a NULL deref when we hit do_exit() around exit_mm(). The first p->mm
> read is before and sees !NULL, the second is after and does observe
> NULL, which triggers a null pointer dereference.

This is horribly ugly, and I don't think it is necessary.

The way to fix the problem you are addressing in patches 2-3 is to
move the MEMBARRIER_STATE_GLOBAL_EXPEDITED flag from the mm struct to
the task struct, and avoiding the whole issue with "mm may be released
at any point" that way.

Now, your reaction will be "but lots of threads can share an 'mm', so
we can't do that", but that doesn't seem to be true. Looking at the
place that _sets_ this, you already handle the single-thread cases
specially, and the multiple threads has to do a "synchronize_rcu()".
You might as well either walk the current CPU's and set it in all
threads where the thread->mm matches the mm. And then you make the
scheduler set the bit on newly scheduled entities.

NOTE! When you walk all current cpu's in
membarrier_register_global_expedited(), you only look at the
'task->mm' _value_, you don't dereference it. And that's ok, because
'task' itself is stable, it's just mm that can go away.

Wouldn't that solve the issue much more cleanly?

             Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value
  2019-09-03 16:12   ` Linus Torvalds
@ 2019-09-03 16:56     ` Mathieu Desnoyers
  2019-09-03 17:14       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 16:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

----- On Sep 3, 2019, at 12:12 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Tue, Sep 3, 2019 at 9:00 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> probe_kernel_address can return -EFAULT on error, which leads to use of
>> an uninitialized or partially initialized sighand variable.
> 
> I think this comment and this code is actively misleading.
> 
> There is no "uninitialized or partially initialized sighand variable".
> That's completely wrong.
> 
> The sighand variable is always completely initialized. It's just that
> the check for "is it initialized" is _not_ the return value from
> probe_kernel_address(), because that return value is simply not
> sufficient.
> 
> So this is just wrong. Don't do it. You're just confusing the issue,
> and you're making statments that aren't true in the commit message,
> and making the code do a pointless and odd check.
> 
> If you want to change this code for legibility, you should just add a
> comment above the probe_kernel_address() about why the return value is
> ignored, and why the check _below_ that code verifies the value of
> sighand with a different check.

Then I must be misunderstanding something.

probe_kernel_address() is a macro wrapping probe_kernel_read().
mm/maccess.c:probe_kernel_read() calls probe_read_common()
mm/maccess.c:probe_read_common() calls __copy_from_user_inatomic()

include/linux/uaccess.h:__copy_from_user_inatomic() documents:

 * NOTE: only copy_from_user() zero-pads the destination in case of short copy.
 * Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
 * at all; their callers absolutely must check the return value.

So considering that comment, I suspect the on-stack sighand variable
within task_rcu_dereference() can be left either uninitialized or
(less likely) partially initialized if probe_kernel_address() returns
-EFAULT.

Is there anything else that prevents probe_kernel_address from failing ?
If so, why use probe_kernel_address in the first place ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value
  2019-09-03 16:56     ` Mathieu Desnoyers
@ 2019-09-03 17:14       ` Linus Torvalds
  2019-09-03 17:21         ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-09-03 17:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

On Tue, Sep 3, 2019 at 9:56 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Then I must be misunderstanding something.
>
> probe_kernel_address() is a macro wrapping probe_kernel_read().

Don't look at probe_kernel_address().

As long as you only look at that, you will be missing the big picture.

Instead, look at the code below it:

        /*
         * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
         * was already freed we can not miss the preceding update of this
         * pointer.
         */
        smp_rmb();
        if (unlikely(task != READ_ONCE(*ptask)))
                goto retry;


That code is the code that verifies "ok, the pointer was valid over
the whole sequence, so the probe_kernel_address() must have succeeded"

So the code *does* check for success, but it does so using a
*stronger* check than the return value of probe_kernel_address().

If the task on the runqueue hasn't changed, then the
probe_kernel_read() cannot have failed.

But the reverse test is not true: if the probe_kernel_read()
succeeded, that doesn't guarantee that the value we read was
consistent.

So the check for failure is there, and the check that does exist is
the correct and stronger check.

Which is why checking the return value of probe_kernel_read() is
immaterial and pointless.

But a comment about this above the probe_kernel_read() may indeed be
worth it, since it seems to be unclear to so many people.

The code basically just wants to do a kernel memory access, knowing
that it's speculative. And the _only_ reason for using
probe_kernel_read() is that with DEBUG_PAGEALLOC you might have a page
fault on the speculative access.

But we do the speculation verification check afterwards, and that's
the important part.

                      Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value
  2019-09-03 17:14       ` Linus Torvalds
@ 2019-09-03 17:21         ` Mathieu Desnoyers
  0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

----- On Sep 3, 2019, at 1:14 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Tue, Sep 3, 2019 at 9:56 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Then I must be misunderstanding something.
>>
>> probe_kernel_address() is a macro wrapping probe_kernel_read().
> 
> Don't look at probe_kernel_address().
> 
> As long as you only look at that, you will be missing the big picture.
> 
> Instead, look at the code below it:
> 
>        /*
>         * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
>         * was already freed we can not miss the preceding update of this
>         * pointer.
>         */
>        smp_rmb();
>        if (unlikely(task != READ_ONCE(*ptask)))
>                goto retry;
> 
> 
> That code is the code that verifies "ok, the pointer was valid over
> the whole sequence, so the probe_kernel_address() must have succeeded"
> 
> So the code *does* check for success, but it does so using a
> *stronger* check than the return value of probe_kernel_address().
> 
> If the task on the runqueue hasn't changed, then the
> probe_kernel_read() cannot have failed.
> 
> But the reverse test is not true: if the probe_kernel_read()
> succeeded, that doesn't guarantee that the value we read was
> consistent.
> 
> So the check for failure is there, and the check that does exist is
> the correct and stronger check.
> 
> Which is why checking the return value of probe_kernel_read() is
> immaterial and pointless.
> 
> But a comment about this above the probe_kernel_read() may indeed be
> worth it, since it seems to be unclear to so many people.
> 
> The code basically just wants to do a kernel memory access, knowing
> that it's speculative. And the _only_ reason for using
> probe_kernel_read() is that with DEBUG_PAGEALLOC you might have a page
> fault on the speculative access.
> 
> But we do the speculation verification check afterwards, and that's
> the important part.

Indeed, thanks for the explanation. Given that this code will likely be
changed by patchsets submitted by others which will possibly remove the
entire thing, and that it currently works as intended, I do not plan on
submitting any further patch to that function at this stage.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited
  2019-09-03 16:23   ` Linus Torvalds
@ 2019-09-03 20:13     ` Mathieu Desnoyers
  0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2019-09-03 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Eric W. Biederman, Russell King,
	ARM Linux, Chris Metcalf, Chris Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar, linux-kernel

----- On Sep 3, 2019, at 12:23 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Tue, Sep 3, 2019 at 9:00 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Due to the lack of READ_ONCE() on p->mm, this code can in fact turn into
>> a NULL deref when we hit do_exit() around exit_mm(). The first p->mm
>> read is before and sees !NULL, the second is after and does observe
>> NULL, which triggers a null pointer dereference.
> 
> This is horribly ugly, and I don't think it is necessary.
> 
> The way to fix the problem you are addressing in patches 2-3 is to
> move the MEMBARRIER_STATE_GLOBAL_EXPEDITED flag from the mm struct to
> the task struct, and avoiding the whole issue with "mm may be released
> at any point" that way.
> 
> Now, your reaction will be "but lots of threads can share an 'mm', so
> we can't do that", but that doesn't seem to be true. Looking at the
> place that _sets_ this, you already handle the single-thread cases
> specially, and the multiple threads has to do a "synchronize_rcu()".
> You might as well either walk the current CPU's and set it in all
> threads where the thread->mm matches the mm. And then you make the
> scheduler set the bit on newly scheduled entities.
> 
> NOTE! When you walk all current cpu's in
> membarrier_register_global_expedited(), you only look at the
> 'task->mm' _value_, you don't dereference it. And that's ok, because
> 'task' itself is stable, it's just mm that can go away.
> 
> Wouldn't that solve the issue much more cleanly?

Indeed it would! I just sent a patch as RFC implementing your idea.

Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-03 20:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 16:00 [RFC PATCH 0/3] sched and membarrier probe_kernel_address fixes Mathieu Desnoyers
2019-09-03 16:00 ` [RFC PATCH 1/3] Fix: sched: task_rcu_dereference: check probe_kernel_address return value Mathieu Desnoyers
2019-09-03 16:12   ` Linus Torvalds
2019-09-03 16:56     ` Mathieu Desnoyers
2019-09-03 17:14       ` Linus Torvalds
2019-09-03 17:21         ` Mathieu Desnoyers
2019-09-03 16:00 ` [RFC PATCH 2/3] Fix: sched/membarrier: READ_ONCE p->mm in membarrier_global_expedited Mathieu Desnoyers
2019-09-03 16:23   ` Linus Torvalds
2019-09-03 20:13     ` Mathieu Desnoyers
2019-09-03 16:00 ` [RFC PATCH 3/3] Fix: sched/membarrier: use probe_kernel_address to read mm->membarrier_state Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).