linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
@ 2021-05-04 15:55 Varad Gautam
  2021-05-04 21:53 ` Davidlohr Bueso
  2021-05-08 17:12 ` Manfred Spraul
  0 siblings, 2 replies; 10+ messages in thread
From: Varad Gautam @ 2021-05-04 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: varad.gautam, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton,
	Davidlohr Bueso

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call
pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive call
might return and leave do_mq_timedsend to rely on an invalid address,
causing the following crash:

[  240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
[  240.739991] Call Trace:
[  240.739999]  __x64_sys_mq_timedsend+0x2a9/0x490
[  240.740003]  ? auditd_test_task+0x38/0x40
[  240.740007]  ? auditd_test_task+0x38/0x40
[  240.740011]  do_syscall_64+0x80/0x680
[  240.740017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  240.740019] RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of
`struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
- it holds a valid `struct ext_wait_queue *` as long as the stack has
not been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY).
Here is where the race window begins. (`this` is `ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break. `ewq_addr` gets removed from
info->e_wait_q[RECV].list.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass
to the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing
`this` which sits on the receiver's stack.

Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>

---
 ipc/mqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae2..8f78057c6be53 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
 				  struct mqueue_inode_info *info,
 				  struct ext_wait_queue *this)
 {
+	struct task_struct *t;
+
 	list_del(&this->list);
-	get_task_struct(this->task);
+	t = get_task_struct(this->task);
 
 	/* see MQ_BARRIER for purpose/pairing */
 	smp_store_release(&this->state, STATE_READY);
-	wake_q_add_safe(wake_q, this->task);
+	wake_q_add_safe(wake_q, t);
 }
 
 /* pipelined_send() - send a message directly to the task waiting in
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-04 15:55 [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry Varad Gautam
@ 2021-05-04 21:53 ` Davidlohr Bueso
  2021-05-05  7:49   ` Varad Gautam
  2021-05-08 17:21   ` Manfred Spraul
  2021-05-08 17:12 ` Manfred Spraul
  1 sibling, 2 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2021-05-04 21:53 UTC (permalink / raw)
  To: Varad Gautam
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

On 2021-05-04 08:55, Varad Gautam wrote:
> do_mq_timedreceive calls wq_sleep with a stack local address. The
> sender (do_mq_timedsend) uses this address to later call
> pipelined_send.
> 
> This leads to a very hard to trigger race where a do_mq_timedreceive 
> call
> might return and leave do_mq_timedsend to rely on an invalid address,
> causing the following crash:
> 
> [  240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
> [  240.739991] Call Trace:
> [  240.739999]  __x64_sys_mq_timedsend+0x2a9/0x490
> [  240.740003]  ? auditd_test_task+0x38/0x40
> [  240.740007]  ? auditd_test_task+0x38/0x40
> [  240.740011]  do_syscall_64+0x80/0x680
> [  240.740017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  240.740019] RIP: 0033:0x7f5928e40343
> 
> The race occurs as:
> 
> 1. do_mq_timedreceive calls wq_sleep with the address of
> `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
> - it holds a valid `struct ext_wait_queue *` as long as the stack has
> not been overwritten.
> 
> 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
> do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
> __pipelined_op.
> 
> 3. Sender calls __pipelined_op::smp_store_release(&this->state, 
> STATE_READY).
> Here is where the race window begins. (`this` is `ewq_addr`.)
> 
> 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
> will see `state == STATE_READY` and break. `ewq_addr` gets removed from
> info->e_wait_q[RECV].list.
> 
> 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
> to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
> stack. (Although the address may not get overwritten until another
> function happens to touch it, which means it can persist around for an
> indefinite time.)
> 
> 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
> `struct ext_wait_queue *`, and uses it to find a task_struct to pass
> to the wake_q_add_safe call. In the lucky case where nothing has
> overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
> In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
> bogus address as the receiver's task_struct causing the crash.
> 
> do_mq_timedsend::__pipelined_op() should not dereference `this` after
> setting STATE_READY, as the receiver counterpart is now free to return.
> Change __pipelined_op to call wake_q_add_safe on the receiver's
> task_struct returned by get_task_struct, instead of dereferencing
> `this` which sits on the receiver's stack.
> 
> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")

Right, historically we've always ensured that the waker does the state 
ready
write as the last operation, with or without lockless wake_qs. And this 
commit
broke this:

@@ -923,16 +988,11 @@ static inline void __pipelined_op(struct 
wake_q_head *wake_q,
                                   struct ext_wait_queue *this)
  {
         list_del(&this->list);
-       wake_q_add(wake_q, this->task);
-       /*
-        * Rely on the implicit cmpxchg barrier from wake_q_add such
-        * that we can ensure that updating receiver->state is the last
-        * write operation: As once set, the receiver can continue,
-        * and if we don't have the reference count from the wake_q,
-        * yet, at that point we can later have a use-after-free
-        * condition and bogus wakeup.
-        */
-       this->state = STATE_READY;
+       get_task_struct(this->task);
+
+       /* see MQ_BARRIER for purpose/pairing */
+       smp_store_release(&this->state, STATE_READY);
+       wake_q_add_safe(wake_q, this->task);
  }

.. so while addressing the race against get_task_struct() vs wakee exit 
we ended
up breaking the case where the task returns before the task is added to 
the wake_q
(which actually we explicitly re-orded :). So at this point we know that 
the
->state = STATE_READY must be done after the whole of the wake_q 
addition operation.

Instead, how about the following which closes the race altogether and 
simplifies the
code. This basically goes back to a correct version of fa6004ad4528
(ipc/mqueue: Implement lockless pipelined wakeups). And by correct I 
mean keeping the
smp_store_release() which ensures the proper wakeup semantics.

Thanks,
Davidlohr

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..43f0ae61c40b 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -76,14 +76,15 @@ struct posix_msg_tree_node {
   *   acquiring info->lock.
   *
   * MQ_BARRIER:
- * To achieve proper release/acquire memory barrier pairing, the state 
is set to
- * STATE_READY with smp_store_release(), and it is read with READ_ONCE 
followed
- * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is 
used.
+ * To achieve proper release/acquire memory barrier pairing, the state 
is
+ * set to STATE_READY with smp_store_release() such that it is the last 
write
+ * operation, in which afterwards the blocked task can immediately 
return and
+ * exit. It is read with READ_ONCE followed by 
smp_acquire__after_ctrl_dep().
   *
   * This prevents the following races:
   *
- * 1) With the simple wake_q_add(), the task could be gone already 
before
- *    the increase of the reference happens
+ * 1) With the wake_q_add(), the task could be gone already before
+ *    the increase of the reference happens:
   * Thread A
   *				Thread B
   * WRITE_ONCE(wait.state, STATE_NONE);
@@ -97,10 +98,27 @@ struct posix_msg_tree_node {
   * sys_exit()
   *				get_task_struct() // UaF
   *
- * Solution: Use wake_q_add_safe() and perform the get_task_struct() 
before
+ *
+ * 2) With the wake_q_add(), the waiter task could have returned from 
the
+ *    syscall and overwritten it's task-allocated waiter before the 
sender
+ *    can be added to the wake_q:
+ * Thread A
+ *				Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ *                              ->state = STATE_READY
+ * <timeout returns>
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * overwrite receiver's stack
+ *				wake_q_add(A)
+ *				if (cmpxchg()) // corrupted waiter
+ *
+ * Solution: Use wake_q_add() and queue the task for wakeup before
   * the smp_store_release() that does ->state = STATE_READY.
   *
- * 2) Without proper _release/_acquire barriers, the woken up task
+ *
+ * 3) Without proper _release/_acquire barriers, the woken up task
   *    could read stale data
   *
   * Thread A
@@ -116,7 +134,7 @@ struct posix_msg_tree_node {
   *
   * Solution: use _release and _acquire barriers.
   *
- * 3) There is intentionally no barrier when setting current->state
+ * 4) There is intentionally no barrier when setting current->state
   *    to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
   *    release memory barrier, and the wakeup is triggered when holding
   *    info->lock, i.e. spin_lock(&info->lock) provided a pairing
@@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct 
wake_q_head *wake_q,
  				  struct ext_wait_queue *this)
  {
  	list_del(&this->list);
-	get_task_struct(this->task);
-
+	wake_q_add(wake_q, this->task);
  	/* see MQ_BARRIER for purpose/pairing */
  	smp_store_release(&this->state, STATE_READY);
-	wake_q_add_safe(wake_q, this->task);
  }

  /* pipelined_send() - send a message directly to the task waiting in

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-04 21:53 ` Davidlohr Bueso
@ 2021-05-05  7:49   ` Varad Gautam
  2021-05-05 15:11     ` Davidlohr Bueso
  2021-05-08 17:21   ` Manfred Spraul
  1 sibling, 1 reply; 10+ messages in thread
From: Varad Gautam @ 2021-05-05  7:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

Hi Davidlohr,

On 5/4/21 11:53 PM, Davidlohr Bueso wrote:
> On 2021-05-04 08:55, Varad Gautam wrote:
>> do_mq_timedreceive calls wq_sleep with a stack local address. The
>> sender (do_mq_timedsend) uses this address to later call
>> pipelined_send.
>>
>> This leads to a very hard to trigger race where a do_mq_timedreceive call
>> might return and leave do_mq_timedsend to rely on an invalid address,
>> causing the following crash:
>>
>> [  240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
>> [  240.739991] Call Trace:
>> [  240.739999]  __x64_sys_mq_timedsend+0x2a9/0x490
>> [  240.740003]  ? auditd_test_task+0x38/0x40
>> [  240.740007]  ? auditd_test_task+0x38/0x40
>> [  240.740011]  do_syscall_64+0x80/0x680
>> [  240.740017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  240.740019] RIP: 0033:0x7f5928e40343
>>
>> The race occurs as:
>>
>> 1. do_mq_timedreceive calls wq_sleep with the address of
>> `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
>> - it holds a valid `struct ext_wait_queue *` as long as the stack has
>> not been overwritten.
>>
>> 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
>> do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
>> __pipelined_op.
>>
>> 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY).
>> Here is where the race window begins. (`this` is `ewq_addr`.)
>>
>> 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
>> will see `state == STATE_READY` and break. `ewq_addr` gets removed from
>> info->e_wait_q[RECV].list.
>>
>> 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
>> to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
>> stack. (Although the address may not get overwritten until another
>> function happens to touch it, which means it can persist around for an
>> indefinite time.)
>>
>> 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
>> `struct ext_wait_queue *`, and uses it to find a task_struct to pass
>> to the wake_q_add_safe call. In the lucky case where nothing has
>> overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
>> In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
>> bogus address as the receiver's task_struct causing the crash.
>>
>> do_mq_timedsend::__pipelined_op() should not dereference `this` after
>> setting STATE_READY, as the receiver counterpart is now free to return.
>> Change __pipelined_op to call wake_q_add_safe on the receiver's
>> task_struct returned by get_task_struct, instead of dereferencing
>> `this` which sits on the receiver's stack.
>>
>> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
> 
> Right, historically we've always ensured that the waker does the state ready
> write as the last operation, with or without lockless wake_qs. And this commit
> broke this:
> 
> @@ -923,16 +988,11 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
>                                   struct ext_wait_queue *this)
>  {
>         list_del(&this->list);
> -       wake_q_add(wake_q, this->task);
> -       /*
> -        * Rely on the implicit cmpxchg barrier from wake_q_add such
> -        * that we can ensure that updating receiver->state is the last
> -        * write operation: As once set, the receiver can continue,
> -        * and if we don't have the reference count from the wake_q,
> -        * yet, at that point we can later have a use-after-free
> -        * condition and bogus wakeup.
> -        */
> -       this->state = STATE_READY;
> +       get_task_struct(this->task);
> +
> +       /* see MQ_BARRIER for purpose/pairing */
> +       smp_store_release(&this->state, STATE_READY);
> +       wake_q_add_safe(wake_q, this->task);
>  }
> 
> .. so while addressing the race against get_task_struct() vs wakee exit we ended
> up breaking the case where the task returns before the task is added to the wake_q
> (which actually we explicitly re-orded :). So at this point we know that the
> ->state = STATE_READY must be done after the whole of the wake_q addition operation.
> 

The race here really is about the lifetime of __pipelined_op's `this` argument only
being guaranteed for some duration of the call (ie, until someone sets
->state = STATE_READY). It is not about when wake_q addition happens, as long as it is
being fed a valid task_struct.

Commit c5b2cbdbdac5 (ipc/mqueue.c: update/document memory barriers) aims at the right
spot wrt. reordering, except for relying on the `this` arg to find a task_struct for
wake_q addition.

> Instead, how about the following which closes the race altogether and simplifies the
> code. This basically goes back to a correct version of fa6004ad4528
> (ipc/mqueue: Implement lockless pipelined wakeups). And by correct I mean keeping the
> smp_store_release() which ensures the proper wakeup semantics.
> 

I considered that initially, but given that the race isn't connected with wakeup, I
preferred the current approach which makes this fact explicit by showing what's
valid/invalid during __pipelined_op.

Thanks,
Varad

> Thanks,
> Davidlohr
> 
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 8031464ed4ae..43f0ae61c40b 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -76,14 +76,15 @@ struct posix_msg_tree_node {
>   *   acquiring info->lock.
>   *
>   * MQ_BARRIER:
> - * To achieve proper release/acquire memory barrier pairing, the state is set to
> - * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
> - * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
> + * To achieve proper release/acquire memory barrier pairing, the state is
> + * set to STATE_READY with smp_store_release() such that it is the last write
> + * operation, in which afterwards the blocked task can immediately return and
> + * exit. It is read with READ_ONCE followed by smp_acquire__after_ctrl_dep().
>   *
>   * This prevents the following races:
>   *
> - * 1) With the simple wake_q_add(), the task could be gone already before
> - *    the increase of the reference happens
> + * 1) With the wake_q_add(), the task could be gone already before
> + *    the increase of the reference happens:
>   * Thread A
>   *                Thread B
>   * WRITE_ONCE(wait.state, STATE_NONE);
> @@ -97,10 +98,27 @@ struct posix_msg_tree_node {
>   * sys_exit()
>   *                get_task_struct() // UaF
>   *
> - * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
> + *
> + * 2) With the wake_q_add(), the waiter task could have returned from the
> + *    syscall and overwritten it's task-allocated waiter before the sender
> + *    can be added to the wake_q:
> + * Thread A
> + *                Thread B
> + * WRITE_ONCE(wait.state, STATE_NONE);
> + * schedule_hrtimeout()
> + *                              ->state = STATE_READY
> + * <timeout returns>
> + * if (wait.state == STATE_READY) return;
> + * sysret to user space
> + * overwrite receiver's stack
> + *                wake_q_add(A)
> + *                if (cmpxchg()) // corrupted waiter
> + *
> + * Solution: Use wake_q_add() and queue the task for wakeup before
>   * the smp_store_release() that does ->state = STATE_READY.
>   *
> - * 2) Without proper _release/_acquire barriers, the woken up task
> + *
> + * 3) Without proper _release/_acquire barriers, the woken up task
>   *    could read stale data
>   *
>   * Thread A
> @@ -116,7 +134,7 @@ struct posix_msg_tree_node {
>   *
>   * Solution: use _release and _acquire barriers.
>   *
> - * 3) There is intentionally no barrier when setting current->state
> + * 4) There is intentionally no barrier when setting current->state
>   *    to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
>   *    release memory barrier, and the wakeup is triggered when holding
>   *    info->lock, i.e. spin_lock(&info->lock) provided a pairing
> @@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
>                    struct ext_wait_queue *this)
>  {
>      list_del(&this->list);
> -    get_task_struct(this->task);
> -
> +    wake_q_add(wake_q, this->task);
>      /* see MQ_BARRIER for purpose/pairing */
>      smp_store_release(&this->state, STATE_READY);
> -    wake_q_add_safe(wake_q, this->task);
>  }
> 
>  /* pipelined_send() - send a message directly to the task waiting in
> 

-- 
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany

HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-05  7:49   ` Varad Gautam
@ 2021-05-05 15:11     ` Davidlohr Bueso
  2021-05-05 15:36       ` Varad Gautam
  0 siblings, 1 reply; 10+ messages in thread
From: Davidlohr Bueso @ 2021-05-05 15:11 UTC (permalink / raw)
  To: Varad Gautam
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

On 2021-05-05 00:49, Varad Gautam wrote:
> The race here really is about the lifetime of __pipelined_op's `this`
> argument only
> being guaranteed for some duration of the call (ie, until someone sets
> ->state = STATE_READY). It is not about when wake_q addition happens,
> as long as it is
> being fed a valid task_struct.

Again, it's all about ensuring that the READY_STATE is set last, the 
blocked
task has no business returning in the first place, making both races 
(exit and
the one reported here) similar by ending up using bogus memory.

...

> I considered that initially, but given that the race isn't connected
> with wakeup, I
> preferred the current approach which makes this fact explicit by 
> showing what's
> valid/invalid during __pipelined_op.

I understand your point, but this is why I updated the ordering 
comments. Furthermore
there is no reason to decouple the task's refcount with the wake_q_add 
operation, it
just makes the code weird and harder to follow.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-05 15:11     ` Davidlohr Bueso
@ 2021-05-05 15:36       ` Varad Gautam
  2021-05-05 16:24         ` Davidlohr Bueso
  0 siblings, 1 reply; 10+ messages in thread
From: Varad Gautam @ 2021-05-05 15:36 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

On 5/5/21 5:11 PM, Davidlohr Bueso wrote:
> On 2021-05-05 00:49, Varad Gautam wrote:
>> The race here really is about the lifetime of __pipelined_op's `this`
>> argument only
>> being guaranteed for some duration of the call (ie, until someone sets
>> ->state = STATE_READY). It is not about when wake_q addition happens,
>> as long as it is
>> being fed a valid task_struct.
> 
> Again, it's all about ensuring that the READY_STATE is set last, the blocked
> task has no business returning in the first place, making both races (exit and
> the one reported here) similar by ending up using bogus memory.
> 
> ...
> 
>> I considered that initially, but given that the race isn't connected
>> with wakeup, I
>> preferred the current approach which makes this fact explicit by showing what's
>> valid/invalid during __pipelined_op.
> 
> I understand your point, but this is why I updated the ordering comments. Furthermore
> there is no reason to decouple the task's refcount with the wake_q_add operation, it
> just makes the code weird and harder to follow.
> 

Different tastes I guess - I'd avoid an additional case to account for in the
MQ_BARRIER comment block and have the __pipelined_op code describe itself,
which avoids some back and forth while reading.

If you're still unconvinced, I'll send out a v2 w/ wake_q_add called before
smp_store_release.

Thanks,
Varad

> Thanks,
> Davidlohr
> 

-- 
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany

HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-05 15:36       ` Varad Gautam
@ 2021-05-05 16:24         ` Davidlohr Bueso
  2021-05-05 17:09           ` Davidlohr Bueso
  0 siblings, 1 reply; 10+ messages in thread
From: Davidlohr Bueso @ 2021-05-05 16:24 UTC (permalink / raw)
  To: Varad Gautam
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

On 2021-05-05 08:36, Varad Gautam wrote:
> If you're still unconvinced, I'll send out a v2 w/ wake_q_add called 
> before
> smp_store_release.

Yeah, please send a v2, I believe this is the right way to fix the 
issue.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-05 16:24         ` Davidlohr Bueso
@ 2021-05-05 17:09           ` Davidlohr Bueso
  0 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2021-05-05 17:09 UTC (permalink / raw)
  To: Varad Gautam
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Manfred Spraul, Andrew Morton

On 2021-05-05 09:24, Davidlohr Bueso wrote:
> On 2021-05-05 08:36, Varad Gautam wrote:
>> If you're still unconvinced, I'll send out a v2 w/ wake_q_add called 
>> before
>> smp_store_release.
> 
> Yeah, please send a v2, I believe this is the right way to fix the 
> issue.

Also, it would be good to add:

Cc: <stable@vger.kernel.org> # 5.6

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-04 15:55 [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry Varad Gautam
  2021-05-04 21:53 ` Davidlohr Bueso
@ 2021-05-08 17:12 ` Manfred Spraul
  2021-05-10 10:38   ` Varad Gautam
  1 sibling, 1 reply; 10+ messages in thread
From: Manfred Spraul @ 2021-05-08 17:12 UTC (permalink / raw)
  To: Varad Gautam, linux-kernel
  Cc: Matthias von Faber, Christian Brauner, Oleg Nesterov,
	Eric W. Biederman, Andrew Morton, Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 1977 bytes --]

Hi Varad,

On 5/4/21 5:55 PM, Varad Gautam wrote:
> do_mq_timedsend::__pipelined_op() should not dereference `this` after
> setting STATE_READY, as the receiver counterpart is now free to return.
> Change __pipelined_op to call wake_q_add_safe on the receiver's
> task_struct returned by get_task_struct, instead of dereferencing
> `this` which sits on the receiver's stack.
Correct. I was so concentrated on the risks of reordered memory that I 
have overlooked the simple bug.
> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Actually, sem.c and msg.c contain the same bug. Thus all three must be 
fixed.
> Signed-off-by: Varad Gautam <varad.gautam@suse.com>
> Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Davidlohr Bueso <dbueso@suse.de>
>
> ---
>   ipc/mqueue.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 8031464ed4ae2..8f78057c6be53 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
>   				  struct mqueue_inode_info *info,
>   				  struct ext_wait_queue *this)
>   {
> +	struct task_struct *t;
> +
>   	list_del(&this->list);
> -	get_task_struct(this->task);
> +	t = get_task_struct(this->task);
>   
>   	/* see MQ_BARRIER for purpose/pairing */
>   	smp_store_release(&this->state, STATE_READY);
> -	wake_q_add_safe(wake_q, this->task);
> +	wake_q_add_safe(wake_q, t);
>   }

The change fixes the issue, but I would prefer to use t = this->task 
instead of using the return value of get_task_struct():
Then all wake_q_add_safe() users are identical.

Ok for you?

Slightly tested patch attached.

--

     Manfred


[-- Attachment #2: 0001-ipc-sem.c-mqueue.c-msg.c-Fix-incorrect-wake_q_add_sa.patch --]
[-- Type: text/x-patch, Size: 4005 bytes --]

From 06f91b4bbc440e9509c85acb7be1b15388b7bc0f Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Sat, 8 May 2021 18:41:30 +0200
Subject: [PATCH] ipc/sem.c, mqueue.c, msg.c: Fix incorrect wake_q_add_safe()

The wakeup code used by ipc contains a potential use-after-free:
When modifying the code to use wake_q_add_safe(), it was
forgotten to transfer the task struct pointer into a local
variable before the smp_store_release().

Solution: Add local variables to the affected functions.

Result: ipc is now using the same approach as kernel/futex.c
and kernel/locking/rwsem.c.

Note: No need to use READ_ONCE(), as smp_store_release() contains
a barrier(), thus the compiler cannot reread ptr->task.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Cc: Varad Gautam <varad.gautam@suse.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
---
 ipc/mqueue.c | 12 +++++++++---
 ipc/msg.c    |  6 ++++--
 ipc/sem.c    |  6 ++++--
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..838c4f24a337 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -78,7 +78,11 @@ struct posix_msg_tree_node {
  * MQ_BARRIER:
  * To achieve proper release/acquire memory barrier pairing, the state is set to
  * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
- * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
+ * by smp_acquire__after_ctrl_dep(). Immediately after the smp_store_release(),
+ * the struct ext_wait_queue can go out of scope. Thus the task struct pointer
+ * is copied into a local variable. The wakeup is performed using
+ * wake_q_add_safe().
+ *
  *
  * This prevents the following races:
  *
@@ -1004,12 +1008,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
 				  struct mqueue_inode_info *info,
 				  struct ext_wait_queue *this)
 {
+	struct task_struct *task = this->task;
+
 	list_del(&this->list);
-	get_task_struct(this->task);
+	get_task_struct(task);
 
 	/* see MQ_BARRIER for purpose/pairing */
 	smp_store_release(&this->state, STATE_READY);
-	wake_q_add_safe(wake_q, this->task);
+	wake_q_add_safe(wake_q, task);
 }
 
 /* pipelined_send() - send a message directly to the task waiting in
diff --git a/ipc/msg.c b/ipc/msg.c
index acd1bc7af55a..d273482b71ea 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -251,11 +251,13 @@ static void expunge_all(struct msg_queue *msq, int res,
 	struct msg_receiver *msr, *t;
 
 	list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
-		get_task_struct(msr->r_tsk);
+		struct task_struct *task = msr->r_tsk;
+
+		get_task_struct(task);
 
 		/* see MSG_BARRIER for purpose/pairing */
 		smp_store_release(&msr->r_msg, ERR_PTR(res));
-		wake_q_add_safe(wake_q, msr->r_tsk);
+		wake_q_add_safe(wake_q, task);
 	}
 }
 
diff --git a/ipc/sem.c b/ipc/sem.c
index e0ec239680cb..04700a823e79 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -784,12 +784,14 @@ static int perform_atomic_semop(struct sem_array *sma, struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 					     struct wake_q_head *wake_q)
 {
-	get_task_struct(q->sleeper);
+	struct task_struct *task = q->sleeper;
+
+	get_task_struct(task);
 
 	/* see SEM_BARRIER_2 for purpose/pairing */
 	smp_store_release(&q->status, error);
 
-	wake_q_add_safe(wake_q, q->sleeper);
+	wake_q_add_safe(wake_q, task);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-04 21:53 ` Davidlohr Bueso
  2021-05-05  7:49   ` Varad Gautam
@ 2021-05-08 17:21   ` Manfred Spraul
  1 sibling, 0 replies; 10+ messages in thread
From: Manfred Spraul @ 2021-05-08 17:21 UTC (permalink / raw)
  To: Davidlohr Bueso, Varad Gautam
  Cc: linux-kernel, Matthias von Faber, Christian Brauner,
	Oleg Nesterov, Eric W. Biederman, Andrew Morton

Hi Davidlohr,

On 5/4/21 11:53 PM, Davidlohr Bueso wrote:
> @@ -1005,11 +1023,9 @@ static inline void __pipelined_op(struct 
> wake_q_head *wake_q,
>                    struct ext_wait_queue *this)
>  {
>      list_del(&this->list);
> -    get_task_struct(this->task);
> -
> +    wake_q_add(wake_q, this->task);
>      /* see MQ_BARRIER for purpose/pairing */
>      smp_store_release(&this->state, STATE_READY);
> -    wake_q_add_safe(wake_q, this->task);
>  }
>
>  /* pipelined_send() - send a message directly to the task waiting in

This can result in lost wakeups:
wake_q_add() wakes up this->task, and the tasks reads this->state before 
the smp_store_release() writes STATE_READY.

I would really prefer if we make kernel/futex.c and ipc/whatever as 
similar as possible:
The futex slow path, ipc/sem.c, ipc/msg.c and ipc/mqueue.c are all the 
same, thus the code should be the same as well.
[a task goes to sleep, another wakes it up, the woken up task doesn't 
acquire any locks, and the wake_q framework is used]

--

     Manfred


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry
  2021-05-08 17:12 ` Manfred Spraul
@ 2021-05-10 10:38   ` Varad Gautam
  0 siblings, 0 replies; 10+ messages in thread
From: Varad Gautam @ 2021-05-10 10:38 UTC (permalink / raw)
  To: Manfred Spraul, linux-kernel
  Cc: Matthias von Faber, Christian Brauner, Oleg Nesterov,
	Eric W. Biederman, Andrew Morton, Davidlohr Bueso

Hey Manfred,

On 5/8/21 7:12 PM, Manfred Spraul wrote:
> Hi Varad,
> 
> On 5/4/21 5:55 PM, Varad Gautam wrote:
>> do_mq_timedsend::__pipelined_op() should not dereference `this` after
>> setting STATE_READY, as the receiver counterpart is now free to return.
>> Change __pipelined_op to call wake_q_add_safe on the receiver's
>> task_struct returned by get_task_struct, instead of dereferencing
>> `this` which sits on the receiver's stack.
> Correct. I was so concentrated on the risks of reordered memory that I have overlooked the simple bug.
>> Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
> Actually, sem.c and msg.c contain the same bug. Thus all three must be fixed.

You're right, it's the same usage pattern.

>> Signed-off-by: Varad Gautam <varad.gautam@suse.com>
>> Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
>> Cc: Christian Brauner <christian.brauner@ubuntu.com>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>> Cc: Manfred Spraul <manfred@colorfullife.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Davidlohr Bueso <dbueso@suse.de>
>>
>> ---
>>   ipc/mqueue.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
>> index 8031464ed4ae2..8f78057c6be53 100644
>> --- a/ipc/mqueue.c
>> +++ b/ipc/mqueue.c
>> @@ -1004,12 +1004,14 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
>>                     struct mqueue_inode_info *info,
>>                     struct ext_wait_queue *this)
>>   {
>> +    struct task_struct *t;
>> +
>>       list_del(&this->list);
>> -    get_task_struct(this->task);
>> +    t = get_task_struct(this->task);
>>         /* see MQ_BARRIER for purpose/pairing */
>>       smp_store_release(&this->state, STATE_READY);
>> -    wake_q_add_safe(wake_q, this->task);
>> +    wake_q_add_safe(wake_q, t);
>>   }
> 
> The change fixes the issue, but I would prefer to use t = this->task instead of using the return value of get_task_struct():
> Then all wake_q_add_safe() users are identical.
> 
> Ok for you?
> 
> Slightly tested patch attached.

Thanks, I've sent out a v4 at [1] integrating sem.c/msg.c. Note that I went with
context-local naming and used what get_task_struct returns as I don't see much
difference either way.

[1] https://lore.kernel.org/lkml/20210510102950.12551-1-varad.gautam@suse.com/

Thanks,
Varad

> 
> -- 
> 
>     Manfred
> 

-- 
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany

HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-10 10:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-04 15:55 [PATCH] ipc/mqueue: Avoid relying on a stack reference past its expiry Varad Gautam
2021-05-04 21:53 ` Davidlohr Bueso
2021-05-05  7:49   ` Varad Gautam
2021-05-05 15:11     ` Davidlohr Bueso
2021-05-05 15:36       ` Varad Gautam
2021-05-05 16:24         ` Davidlohr Bueso
2021-05-05 17:09           ` Davidlohr Bueso
2021-05-08 17:21   ` Manfred Spraul
2021-05-08 17:12 ` Manfred Spraul
2021-05-10 10:38   ` Varad Gautam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).