linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] sigqueue cache fix
@ 2021-06-24  7:13 Ingo Molnar
  2021-06-24 16:29 ` Linus Torvalds
  2021-06-24 16:34 ` pr-tracker-bot
  0 siblings, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2021-06-24  7:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Andrew Morton

Linus,

Please pull the latest core/urgent git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-2021-06-24

   # HEAD: 399f8dd9a866e107639eabd3c1979cd526ca3a98 signal: Prevent sigqueue caching after task got released

Fix a memory leak in the recently introduced sigqueue cache.

 Thanks,

	Ingo

------------------>
Thomas Gleixner (1):
      signal: Prevent sigqueue caching after task got released


 kernel/signal.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..f1ecd8f0c11d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -435,6 +435,12 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 		 * Preallocation does not hold sighand::siglock so it can't
 		 * use the cache. The lockless caching requires that only
 		 * one consumer and only one producer run at a time.
+		 *
+		 * For the regular allocation case it is sufficient to
+		 * check @q for NULL because this code can only be called
+		 * if the target task @t has not been reaped yet; which
+		 * means this code can never observe the error pointer which is
+		 * written to @t->sigqueue_cache in exit_task_sigqueue_cache().
 		 */
 		q = READ_ONCE(t->sigqueue_cache);
 		if (!q || sigqueue_flags)
@@ -463,13 +469,18 @@ void exit_task_sigqueue_cache(struct task_struct *tsk)
 	struct sigqueue *q = tsk->sigqueue_cache;
 
 	if (q) {
-		tsk->sigqueue_cache = NULL;
 		/*
 		 * Hand it back to the cache as the task might
 		 * be self reaping which would leak the object.
 		 */
 		 kmem_cache_free(sigqueue_cachep, q);
 	}
+
+	/*
+	 * Set an error pointer to ensure that @tsk will not cache a
+	 * sigqueue when it is reaping it's child tasks
+	 */
+	tsk->sigqueue_cache = ERR_PTR(-1);
 }
 
 static void sigqueue_cache_or_free(struct sigqueue *q)
@@ -481,6 +492,10 @@ static void sigqueue_cache_or_free(struct sigqueue *q)
 	 * is intentional when run without holding current->sighand->siglock,
 	 * which is fine as current obviously cannot run __sigqueue_free()
 	 * concurrently.
+	 *
+	 * The NULL check is safe even if current has been reaped already,
+	 * in which case exit_task_sigqueue_cache() wrote an error pointer
+	 * into current->sigqueue_cache.
 	 */
 	if (!READ_ONCE(current->sigqueue_cache))
 		WRITE_ONCE(current->sigqueue_cache, q);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-24  7:13 [GIT PULL] sigqueue cache fix Ingo Molnar
@ 2021-06-24 16:29 ` Linus Torvalds
  2021-06-27 18:52   ` Linus Torvalds
  2021-06-24 16:34 ` pr-tracker-bot
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-06-24 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton

On Thu, Jun 24, 2021 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> Fix a memory leak in the recently introduced sigqueue cache.

Side note: I detest the READ_ONCE/WRITE_ONCE games for the sigqueue
cache thing. I also think it's actively buggy.

The sigqueue cache is only safe when it's serialized by the sighand
lock, so all those games with "ONCE" accesses are entirely bogus and
misleading.

It looks like this is due to __sigqueue_alloc() being called without
the lock held (in which case sigqueue_flags will be SIGQUEUE_PREALLOC)
and then it does this:

                q = READ_ONCE(t->sigqueue_cache);
                if (!q || sigqueue_flags)
                        q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);

where the point is that it reads that sigqueue_cache thing even if
sigqueue_flags is non-zero (no lock), so presumably KCSAN complained
about it.

However, hiding it from KCSAN (instead of just test sigqueue_flags
*first* - which admittedly would presumably need some kind of
serialization) is not only ugly, I think it's hiding the *real* bug,
which is

  release_posix_timer ->
    sigqueue_free ->
      __sigqueue_free ->
      sigqueue_cache_or_free

where that sigqueue_free() thing explicitly calls __sigqueue_free()
*outside* the sighand lock. End result: I don't think that cache is
serialized at all.

So I think the sigqueue cache is still potentially quite buggy, and I
think the bug is hidden by the READ/WRITE_ONCE games that are
misleading and not actually valid.

But maybe there is something I'm missing that makes that
release_posix_timer case ok. Or maybe I'm misunderstanding that code
entirely. But it does look bad to me. And the ONCE accesses look
positively wrong in every single possible way: they aren't logical,
and they are actively hindering KCSAN rather than fixing anything at
all.

              Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-24  7:13 [GIT PULL] sigqueue cache fix Ingo Molnar
  2021-06-24 16:29 ` Linus Torvalds
@ 2021-06-24 16:34 ` pr-tracker-bot
  1 sibling, 0 replies; 12+ messages in thread
From: pr-tracker-bot @ 2021-06-24 16:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton

The pull request you sent on Thu, 24 Jun 2021 09:13:05 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-2021-06-24

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7749b0337b4e92d83f7e04b86434dcf4fe531377

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-24 16:29 ` Linus Torvalds
@ 2021-06-27 18:52   ` Linus Torvalds
  2021-06-27 20:40     ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-06-27 18:52 UTC (permalink / raw)
  To: Ingo Molnar, Christian Brauner, Oleg Nesterov
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton

[ Adding Christian and Oleg to participants ]

On Thu, Jun 24, 2021 at 9:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I think the sigqueue cache is still potentially quite buggy, and I
> think the bug is hidden by the READ/WRITE_ONCE games that are
> misleading and not actually valid.

Guys, I haven't heard any reaction to this. Any comments?

Because the more I look at it, the stranger it looks.

In particular: the code in sigqueue_cache_or_free() is a simple

        if (!READ_ONCE(current->sigqueue_cache))
                WRITE_ONCE(current->sigqueue_cache, q);

and it is documented to be safe because "current" is obviously single-threaded.

Except that documented "obviously" is not so obvious at all. Yes,
"current" is single-threaded, but only in task context. You can still
have interrupts etc that see that same "current" that happen
concurrently.

So it's not at all obviously safe. It *may* be safe, but it worries me.

It worries me _particularly_ with exactly this commit 399f8dd9a866
("signal: Prevent sigqueue caching after task got released").

Why? Because the alleged path is release_task() -> __exit_signal() ->
exit_task_sigqueue_cache(). And by the time "release_task()" runs,
that task it releases shouldn't be running. So how can release_task()
race with this logic in sigqueue_cache_or_free()?

IOW how can that change in commit 399f8dd9a866 _possibly_ fix
anything? That would seem to be a serious problem if it's actually the
case..

So I think

 (a) sigqueue_cache_or_free() is fine only if no signal is ever
released from interrupt/bh context.

 (b) commit 399f8dd9a866 looks dodgy to me - could we really ever do
"release_task(current)" without it being a huge bug?

Anyway, trying to really distill the logic of the sigqueue_cache, I've
come up with

 - sigqueue_cache_or_free() only does something if saw NULL (and will
turn it non-NULL)

 - __sigqueue_alloc() only does something if it saw a non-NULL value
(and will turn it NULL)

so they can't race with each other, because their initial values are disjoint.

So then we have

 (a) sigqueue_cache_or_free() allegedly cannot race with itself
because of "current".

 (b) __sigqueue_alloc() allegedly cannot race with itself because of
sighand->siglock

Now (b) I will actually believe, because __sigqueue_alloc() has only
two callers, and the first one actually has

        assert_spin_locked(&t->sighand->siglock);

and the second one passes SIGQUEUE_PREALLOC as sigqueue_flags, and
that will force it to not touch sigqueue_cache at all.

So I think __sigqueue_alloc() is ok.

Which makes me really suspect that (a) is the problem here.

Looking at what calls __sigqueue_free() -> sigqueue_cache_or_free(), we have:

 - flush_sigqueue

 - flush_itimer_signals() -> __flush_itimer_signals()

 - dequeue_signal() -> __dequeue_signal -> collect_signal()

 - get_signal() -> dequeue_synchronous_signal() (and dequeue_signal())

 - send_signal() -> __send_signal() -> prepare_signal() -> flush_sigqueue_mask()

 - kill_pid_usb_asyncio() -> __send_signal() -> ..

 - do_notify_parent() -> __send_signal() -> ..

 - send_sigqueue() -> prepare_signal() -> flush_sigqueue_mask()

 - kernel_sigaction() -> flush_sigqueue_mask()

 - sigqueue_free()

so there's a lot of things that can get into sigqueue_cache_or_free(),
and it's worth noting that that path does *NOT* serialize on the
sighand->siglock, but expressly purely on "current" being
single-threaded (and 'current' has nothing to do with sighand->siglock
anyway, the sighand lock is for the target of the signal, not the
source of it).

At at least that send_signal() path is very much called from
interrupts (ie timers etc).

Hmm?

Ok, I may have confused myself looking at all this, but it does all
make me think this is dodgy.

                  Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-27 18:52   ` Linus Torvalds
@ 2021-06-27 20:40     ` Linus Torvalds
  2021-06-28  5:14       ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-06-27 20:40 UTC (permalink / raw)
  To: Ingo Molnar, Christian Brauner, Oleg Nesterov
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton

On Sun, Jun 27, 2021 at 11:52 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, I may have confused myself looking at all this, but it does all
> make me think this is dodgy.

I also couldn't convince myself that the memory ordering is correct
for the _contents_ of the sigqueue entry that had its pointer cached,
although I suspect that is purely a theoretical concern (certainly a
non-issue on x86).

So I've reverted the sigqueue cache code, in that I haven't heard
anything back and I'm not going to delay 5.13 over something small and
easily undone like this.

It could be that my worries about this code are all entirely wrong,
and I'm missing something, in which case you can call me names and we
can reinstate the code with my apologies.

Anyway, I tried to distill my thoughts about what is going on - and
what the solutions might be - in the revert commit b4b27b9eed8e.

Again, maybe it's just me that is confused. But reverting seemed to be
the safest thing to do considering the timing of this all.

           Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-27 20:40     ` Linus Torvalds
@ 2021-06-28  5:14       ` Ingo Molnar
  2021-06-28  5:22         ` Ingo Molnar
  2021-06-28 17:06         ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2021-06-28  5:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Jun 27, 2021 at 11:52 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok, I may have confused myself looking at all this, but it does all
> > make me think this is dodgy.
> 
> I also couldn't convince myself that the memory ordering is correct
> for the _contents_ of the sigqueue entry that had its pointer cached,
> although I suspect that is purely a theoretical concern (certainly a
> non-issue on x86).
> 
> So I've reverted the sigqueue cache code, in that I haven't heard
> anything back and I'm not going to delay 5.13 over something small and
> easily undone like this.

I concur that it was the safest to revert this, because it was close to the 
final release.

I think the code is safe, but only by accident. The most critical data race 
isn't well-documented, unless I missed something.

The most fundamental race we can have is this:

      CPU#0

      __sigqueue_alloc()

      [ holds sighand->siglock ]
      [ IRQs off. ]

      q = READ_ONCE(t->sigqueue_cache);
      if (!q || sigqueue_flags)
            q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
      else
            WRITE_ONCE(t->sigqueue_cache, NULL);


                                CPU#1  

                                __sigqueue_free()

                                [ IRQs off. ]

                                if (!READ_ONCE(current->sigqueue_cache))
                                      WRITE_ONCE(current->sigqueue_cache, q);
                                else
                                      kmem_cache_free(sigqueue_cachep, q);

( Let's assume exit_task_sigqueue_cache() happens while there's no new 
  signal sending going on, so that angle is safe. )

Someone confusingly, *alloc() is the consumer and *free() is the producer 
of the sigqueue_cache.

Here's how I see the 3 fundamental races these two pieces of code may have:

 - Producer <-> producer: The producer cannot race with itself, because it 
   only ever produces into current->sigqueue_cache and has interrupts 
   disabled. We don't send signals from NMI context.

 - Consumer <-> consumer: multiple consumers cannot race with themselves, 
   because they serialize on sighand->siglock.

 - Producer <-> consumer: this is the most interesting race, and I think 
   it's unsafe in theory, because the producer doesn't make sure that any 
   previous writes to the actual queue entry (struct sigqueue *q) have 
   reached storage before the new 'free' entry is advertised to consumers.

   So in principle CPU#0 could see a new sigqueue entry and use it, before 
   it's fully freed.

   In *practice* it's probably safe by accident (or by undocumented 
   intent), because there's an atomic op we have shortly before putting the 
   queue entry into the sigqueue_cache, in __sigqueue_free():

         if (atomic_dec_and_test(&q->user->sigpending))
                free_uid(q->user);

   And atomic_dec_and_test() implies a full barrier - although I haven't 
   found the place where we document it and 
   Documentation/memory-ordering.txt is silent on it. We should probably 
   fix that too.

At minimum the patch adding the ->sigqueue_cache should include a 
well-documented race analysis firmly documenting the implicit barrier after 
the atomic_dec_and_test().

Anyway, I agree with the revert.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28  5:14       ` Ingo Molnar
@ 2021-06-28  5:22         ` Ingo Molnar
  2021-06-28  5:30           ` Ingo Molnar
  2021-06-28 17:06         ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2021-06-28  5:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Ingo Molnar <mingo@kernel.org> wrote:

>  - Producer <-> consumer: this is the most interesting race, and I think 
>    it's unsafe in theory, because the producer doesn't make sure that any 
>    previous writes to the actual queue entry (struct sigqueue *q) have 
>    reached storage before the new 'free' entry is advertised to consumers.
> 
>    So in principle CPU#0 could see a new sigqueue entry and use it, before 
>    it's fully freed.
> 
>    In *practice* it's probably safe by accident (or by undocumented 
>    intent), because there's an atomic op we have shortly before putting the 
>    queue entry into the sigqueue_cache, in __sigqueue_free():
> 
>          if (atomic_dec_and_test(&q->user->sigpending))
>                 free_uid(q->user);
> 
>    And atomic_dec_and_test() implies a full barrier - although I haven't 
>    found the place where we document it and 
>    Documentation/memory-ordering.txt is silent on it. We should probably 
>    fix that too.
> 
> At minimum the patch adding the ->sigqueue_cache should include a 
> well-documented race analysis firmly documenting the implicit barrier after 
> the atomic_dec_and_test().

I just realized that even with that implicit full barrier it's not safe: 
the producer uses q->user after the atomic_dec_and_test(). That access is 
not serialized with the later write to ->sigqueue_cache - and another CPU 
might see that entry and use the ->sigqueue_cache and corrupt q->user ...

So I think this code might have a real race on LL/SC platforms and we'll 
need an smp_mb() in sigqueue_cache_or_free()?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28  5:22         ` Ingo Molnar
@ 2021-06-28  5:30           ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2021-06-28  5:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> >  - Producer <-> consumer: this is the most interesting race, and I think 
> >    it's unsafe in theory, because the producer doesn't make sure that any 
> >    previous writes to the actual queue entry (struct sigqueue *q) have 
> >    reached storage before the new 'free' entry is advertised to consumers.
> > 
> >    So in principle CPU#0 could see a new sigqueue entry and use it, before 
> >    it's fully freed.
> > 
> >    In *practice* it's probably safe by accident (or by undocumented 
> >    intent), because there's an atomic op we have shortly before putting the 
> >    queue entry into the sigqueue_cache, in __sigqueue_free():
> > 
> >          if (atomic_dec_and_test(&q->user->sigpending))
> >                 free_uid(q->user);
> > 
> >    And atomic_dec_and_test() implies a full barrier - although I haven't 
> >    found the place where we document it and 
> >    Documentation/memory-ordering.txt is silent on it. We should probably 
> >    fix that too.
> > 
> > At minimum the patch adding the ->sigqueue_cache should include a 
> > well-documented race analysis firmly documenting the implicit barrier after 
> > the atomic_dec_and_test().
> 
> I just realized that even with that implicit full barrier it's not safe: 
> the producer uses q->user after the atomic_dec_and_test(). That access is 
> not serialized with the later write to ->sigqueue_cache - and another CPU 
> might see that entry and use the ->sigqueue_cache and corrupt q->user ...
> 
> So I think this code might have a real race on LL/SC platforms and we'll 
> need an smp_mb() in sigqueue_cache_or_free()?

pps. free_uid() happens to have an implicit barrier via 
refcount_dec_and_lock_irqsave():

  void free_uid(struct user_struct *up)
  {
        unsigned long flags;

        if (!up)
                return;

        if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
                free_user(up, flags);


So the q->user read in __sigqueue_free() appears to be implicitly 
serialized by free_uid() with the later write of 'q' to ->sigqueue_cache.

This needs to be robustly documented though.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28  5:14       ` Ingo Molnar
  2021-06-28  5:22         ` Ingo Molnar
@ 2021-06-28 17:06         ` Linus Torvalds
  2021-06-28 18:46           ` Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-06-28 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton

On Sun, Jun 27, 2021 at 10:14 PM Ingo Molnar <mingo@kernel.org> wrote:
>
> The most fundamental race we can have is this:

No. It's this (all on the same CPU):

   sigqueue_cache_or_free():

       if (!READ_ONCE(current->sigqueue_cache))
                     <-- Interrupt happens here
               WRITE_ONCE(current->sigqueue_cache, q);

and then the interrupt sends a SIGCONT, which ends up flushing
previous process control signals, which ends up freeing them, which
ends up in sigqueue_cache_or_free() again, at which point you have

       if (!READ_ONCE(current->sigqueue_cache))
               WRITE_ONCE(current->sigqueue_cache, q);

again.

And both the original and the interrupting one sees a NULL
current->sigqueue_cache, so both of them will do that WRITE_ONCE(),
and when the interrupt returns, the original case will overwrite the
value that the interrupt free'd.

Boom - memory leak.

It does seem to be very small race window, and it's "only" a memory
leak, but it's a very simple example of how this cache was broken even
on UP.

              Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28 17:06         ` Linus Torvalds
@ 2021-06-28 18:46           ` Ingo Molnar
  2021-06-28 19:02             ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2021-06-28 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Jun 27, 2021 at 10:14 PM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > The most fundamental race we can have is this:
> 
> No. It's this (all on the same CPU):
> 
>    sigqueue_cache_or_free():
> 
>        if (!READ_ONCE(current->sigqueue_cache))
>                      <-- Interrupt happens here
>                WRITE_ONCE(current->sigqueue_cache, q);

Indeed - I was under the impression that this cannot happen, because 
interrupts are disabled - but I was wrong:

__sigqueue_free() is the only user of sigqueue_cache_or_free().

Callers of __sigqueue_free():

 - flush_sigqueue():
    # flush_signals() is holding the siglock & disables IRQs
    # __exit_signal() isn't holding the siglock but has IRQs disabled
    # selinux_bprm_committed_creds() is holding the siglock & disables IRQs

 - __flush_itimer_signals()
    # Its single caller is holding the siglock & disables IRQs

 - collect_signal()
    # Its single caller is holding the siglock & disables IRQs

 - dequeue_synchronous_signal()
    # Its single caller is holding the siglock & disables IRQs

 - flush_sigqueue_mask():
    # All callers are holding the siglock & disable IRQs

 - sigqueue_free()
    ...    

Boom, the last one on the list, sigqueue_free(), does __sigqueue_free() 
while not holding the siglock and not disabling interrupts. :-/

It does it in various syscall paths in the POSIX timers code through 
release_posix_timer(), with interrupts clearly enabled.

> and then the interrupt sends a SIGCONT, which ends up flushing
> previous process control signals, which ends up freeing them, which
> ends up in sigqueue_cache_or_free() again, at which point you have
> 
>        if (!READ_ONCE(current->sigqueue_cache))
>                WRITE_ONCE(current->sigqueue_cache, q);
> 
> again.
> 
> And both the original and the interrupting one sees a NULL
> current->sigqueue_cache, so both of them will do that WRITE_ONCE(),
> and when the interrupt returns, the original case will overwrite the
> value that the interrupt free'd.
> 
> Boom - memory leak.
> 
> It does seem to be very small race window, and it's "only" a memory
> leak, but it's a very simple example of how this cache was broken even
> on UP.

Yeah - a clear Producer <-> Producer IRQ preemptability race that can leak 
freed sigqueue structures.

Thanks for catching this ...

But even if release_posix_timer() is changed to call sigqueue_free() with 
IRQs disabled, or sigqueue_free() disables interrupts itself, I think we 
need to be mindful of the Consumer <-> Producer SMP races, which only 
appear to be safe due to an accidental barrier by free_uid().

Plus a lockdep_assert_irqs_disabled() would have helped a lot in catching 
this sooner.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28 18:46           ` Ingo Molnar
@ 2021-06-28 19:02             ` Linus Torvalds
  2021-07-07  9:47               ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-06-28 19:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton

On Mon, Jun 28, 2021 at 11:47 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> But even if release_posix_timer() is changed to call sigqueue_free() with
> IRQs disabled, or sigqueue_free() disables interrupts itself, I think we
> need to be mindful of the Consumer <-> Producer SMP races, which only
> appear to be safe due to an accidental barrier by free_uid().

Oh, I agree. The SMP memory ordering issues are subtle and while I
suspect they aren't something that can be triggered in reality, it's
an example of how broken some WRITE_ONCE -> READ_ONCE serialization
is.

I don't mind READ_ONCE/WRITE_ONCE in general, but I absolutely detest
them when they are used to hide KCSAN things, and when they are used
randomly for pointers.

I think they are much better suited for one-time flag things, and
pretty much every time you see a READ_ONCE/WRITE_ONCE you should
likely see a barrier somewhere.

And no, we don't want them to be some subtle barriers that are just
"in the context of this code, we can depend on the barrier in that
other function".

In general, if you have a WRITE_ONCE -> READ_ONCE chain an dnot some
obvious required barrier for other reasons right *there*, it should
almost certainly not be one of those ONCE things at all. It should be
a proper smp_store_release() -> smp_load_acquire(). Those have real -
and on sane hardware cheap - memory orderings.

And as the whole - and only - point of the sigqueue cache was a very
subtle performance and latency issue, I don't think we want to use
locks or atomics. It's why my revert commit suggests re-purposing the
"percpu_cmpxchg()" functionality: that would likely be a good model at
least for x86 and arm.

But while we have "percpu_cmpxchg()", I don't think we currently don't
really have that kind of operation where it

 (a) works on a non-percpu variable (but we can force casts, I guess)

 (b) has "acquire" semantics

We do have the *atomic* cmpxchg with acquire semantics, but atomics
are rather more expensive than we'd probably really want.

                 Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] sigqueue cache fix
  2021-06-28 19:02             ` Linus Torvalds
@ 2021-07-07  9:47               ` Thomas Gleixner
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Gleixner @ 2021-07-07  9:47 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Christian Brauner, Oleg Nesterov, Linux Kernel Mailing List,
	Peter Zijlstra, Andrew Morton

Linus,

On Mon, Jun 28 2021 at 12:02, Linus Torvalds wrote:
> And as the whole - and only - point of the sigqueue cache was a very
> subtle performance and latency issue, I don't think we want to use
> locks or atomics. It's why my revert commit suggests re-purposing the
> "percpu_cmpxchg()" functionality: that would likely be a good model at
> least for x86 and arm.
>
> But while we have "percpu_cmpxchg()", I don't think we currently don't
> really have that kind of operation where it
>
>  (a) works on a non-percpu variable (but we can force casts, I guess)
>
>  (b) has "acquire" semantics
>
> We do have the *atomic* cmpxchg with acquire semantics, but atomics
> are rather more expensive than we'd probably really want.

first of all, sorry for being non-responsive on this. I'll have a fresh
look at this with your comments in mind and be more careful.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-07  9:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-24  7:13 [GIT PULL] sigqueue cache fix Ingo Molnar
2021-06-24 16:29 ` Linus Torvalds
2021-06-27 18:52   ` Linus Torvalds
2021-06-27 20:40     ` Linus Torvalds
2021-06-28  5:14       ` Ingo Molnar
2021-06-28  5:22         ` Ingo Molnar
2021-06-28  5:30           ` Ingo Molnar
2021-06-28 17:06         ` Linus Torvalds
2021-06-28 18:46           ` Ingo Molnar
2021-06-28 19:02             ` Linus Torvalds
2021-07-07  9:47               ` Thomas Gleixner
2021-06-24 16:34 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).