kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
* Re: [RESEND PATCH v4 05/10] lib/refcount: Improve performance of generic REFCOUNT_FULL code
       [not found] ` <20191121115902.2551-6-will@kernel.org>
@ 2020-02-26  4:10   ` Jann Horn
  2020-02-28 10:43     ` Will Deacon
  0 siblings, 1 reply; 3+ messages in thread
From: Jann Horn @ 2020-02-26  4:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel list, Kees Cook, Ingo Molnar, Elena Reshetova,
	Peter Zijlstra, Ard Biesheuvel, Hanjun Guo, Jan Glauber,
	Kernel Hardening

On Thu, Nov 21, 2019 at 12:58 PM Will Deacon <will@kernel.org> wrote:
> Rewrite the generic REFCOUNT_FULL implementation so that the saturation
> point is moved to INT_MIN / 2. This allows us to defer the sanity checks
> until after the atomic operation, which removes many uses of cmpxchg()
> in favour of atomic_fetch_{add,sub}().

Oh, I never saw this, this is really neat! CCing the kernel-hardening
list on this might've been a good idea.

> + * Saturation semantics
> + * ====================
> + *
> + * refcount_t differs from atomic_t in that the counter saturates at
> + * REFCOUNT_SATURATED and will not move once there. This avoids wrapping the
> + * counter and causing 'spurious' use-after-free issues. In order to avoid the
> + * cost associated with introducing cmpxchg() loops into all of the saturating
> + * operations, we temporarily allow the counter to take on an unchecked value
> + * and then explicitly set it to REFCOUNT_SATURATED on detecting that underflow
> + * or overflow has occurred. Although this is racy when multiple threads
> + * access the refcount concurrently, by placing REFCOUNT_SATURATED roughly
> + * equidistant from 0 and INT_MAX we minimise the scope for error:
> + *
> + *                                INT_MAX     REFCOUNT_SATURATED   UINT_MAX
> + *   0                          (0x7fff_ffff)    (0xc000_0000)    (0xffff_ffff)
> + *   +--------------------------------+----------------+----------------+
> + *                                     <---------- bad value! ---------->
> + *
> + * (in a signed view of the world, the "bad value" range corresponds to
> + * a negative counter value).
[...]
> + * If another thread also performs a refcount_inc() operation between the two
> + * atomic operations, then the count will continue to edge closer to 0. If it
> + * reaches a value of 1 before /any/ of the threads reset it to the saturated
> + * value, then a concurrent refcount_dec_and_test() may erroneously free the
> + * underlying object. Given the precise timing details involved with the
> + * round-robin scheduling of each thread manipulating the refcount and the need
> + * to hit the race multiple times in succession, there doesn't appear to be a
> + * practical avenue of attack even if using refcount_add() operations with
> + * larger increments.

On top of that, the number of threads that can actually be running at
a given time is capped. See include/linux/threads.h, where it is
capped to pow(2, 22):

    /*
     * A maximum of 4 million PIDs should be enough for a while.
     * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
     */
    #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
            (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

And in the futex UAPI header, we have this, baking a TID limit into
the userspace API (note that this is pow(2,30), not pow(2,29) as the
comment in threads.h claims - I'm not sure where that difference comes
from):

    /*
     * The rest of the robust-futex field is for the TID:
     */
    #define FUTEX_TID_MASK 0x3fffffff

So AFAICS, with the current PID_MAX_LIMIT, if you assume that all
participating refcount operations are non-batched (delta 1) and the
attacker can't cause the threads to oops in the middle of the refcount
operation (maybe that would be possible if you managed to find
something like a NULL pointer dereference in perf software event code
and had perf paranoia at <=1 , or something like that? - I'm not
sure), then even in a theoretical scenario where an attacker spawns
the maximum number of tasks possible and manages to get all of them to
sequentially preempt while being in the middle of increment operations
in several nested contexts (I'm not sure whether that can even happen
- you're not going to take typical sleeping exceptions like page
faults in the middle of a refcount op), the attacker will stay
comfortably inside the saturated range. Even if the PID_MAX_LIMIT is
at some point raised to the maximum permitted by the futex UAPI, this
still holds as long as you assume no nesting. Hm, should I send a
patch to add something like this to the explanatory comment?


Of course, if someone uses refcount batching with sufficiently large
values, those guarantees go out of the window - if we wanted to be
perfectionist about this, we could make the batched operations do slow
cmpxchg stuff while letting the much more performance-critical
single-reference case continue to use the fast saturation scheme.
OTOH, the networking folks would probably hate that, since they're
using the batched ops for ->sk_wmem_alloc stuff, where they count
bytes as references? So I guess maybe we should leave it as-is.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RESEND PATCH v4 05/10] lib/refcount: Improve performance of generic REFCOUNT_FULL code
  2020-02-26  4:10   ` [RESEND PATCH v4 05/10] lib/refcount: Improve performance of generic REFCOUNT_FULL code Jann Horn
@ 2020-02-28 10:43     ` Will Deacon
  2020-03-02 13:06       ` Jann Horn
  0 siblings, 1 reply; 3+ messages in thread
From: Will Deacon @ 2020-02-28 10:43 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Kees Cook, Ingo Molnar, Elena Reshetova,
	Peter Zijlstra, Ard Biesheuvel, Hanjun Guo, Jan Glauber,
	Kernel Hardening

Hi Jann,

On Wed, Feb 26, 2020 at 05:10:46AM +0100, Jann Horn wrote:
> On Thu, Nov 21, 2019 at 12:58 PM Will Deacon <will@kernel.org> wrote:
> > Rewrite the generic REFCOUNT_FULL implementation so that the saturation
> > point is moved to INT_MIN / 2. This allows us to defer the sanity checks
> > until after the atomic operation, which removes many uses of cmpxchg()
> > in favour of atomic_fetch_{add,sub}().
> 
> Oh, I never saw this, this is really neat! CCing the kernel-hardening
> list on this might've been a good idea.

Glad you like it! I'll try to remember to cc them in future but
get_maintainer.pl doesn't tend to mention them very often.

> > + * If another thread also performs a refcount_inc() operation between the two
> > + * atomic operations, then the count will continue to edge closer to 0. If it
> > + * reaches a value of 1 before /any/ of the threads reset it to the saturated
> > + * value, then a concurrent refcount_dec_and_test() may erroneously free the
> > + * underlying object. Given the precise timing details involved with the
> > + * round-robin scheduling of each thread manipulating the refcount and the need
> > + * to hit the race multiple times in succession, there doesn't appear to be a
> > + * practical avenue of attack even if using refcount_add() operations with
> > + * larger increments.
> 
> On top of that, the number of threads that can actually be running at
> a given time is capped. See include/linux/threads.h, where it is
> capped to pow(2, 22):
> 
>     /*
>      * A maximum of 4 million PIDs should be enough for a while.
>      * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
>      */
>     #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
>             (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
> 
> And in the futex UAPI header, we have this, baking a TID limit into
> the userspace API (note that this is pow(2,30), not pow(2,29) as the
> comment in threads.h claims - I'm not sure where that difference comes
> from):
> 
>     /*
>      * The rest of the robust-futex field is for the TID:
>      */
>     #define FUTEX_TID_MASK 0x3fffffff
> 
> So AFAICS, with the current PID_MAX_LIMIT, if you assume that all
> participating refcount operations are non-batched (delta 1) and the
> attacker can't cause the threads to oops in the middle of the refcount
> operation (maybe that would be possible if you managed to find
> something like a NULL pointer dereference in perf software event code
> and had perf paranoia at <=1 , or something like that? - I'm not
> sure), then even in a theoretical scenario where an attacker spawns
> the maximum number of tasks possible and manages to get all of them to
> sequentially preempt while being in the middle of increment operations
> in several nested contexts (I'm not sure whether that can even happen
> - you're not going to take typical sleeping exceptions like page
> faults in the middle of a refcount op), the attacker will stay
> comfortably inside the saturated range. Even if the PID_MAX_LIMIT is
> at some point raised to the maximum permitted by the futex UAPI, this
> still holds as long as you assume no nesting. Hm, should I send a
> patch to add something like this to the explanatory comment?

Sure, I'd be happy to improve the document by adding this -- please send
out a patch for review. It's probably also worth mentioning the batching
use-cases, although I struggle to reason about the window between the
{under,over}flow occuring and saturation. The idea of oopsing during that
period is interesting, although at least that's not silent (and I think
Android usually (always?) sets panic_on_oops to 1.

> Of course, if someone uses refcount batching with sufficiently large
> values, those guarantees go out of the window - if we wanted to be
> perfectionist about this, we could make the batched operations do slow
> cmpxchg stuff while letting the much more performance-critical
> single-reference case continue to use the fast saturation scheme.
> OTOH, the networking folks would probably hate that, since they're
> using the batched ops for ->sk_wmem_alloc stuff, where they count
> bytes as references? So I guess maybe we should leave it as-is.

Agreed. If we hamper the performance here, then people will either look
to disable the checking or they will switch to atomic_t, which puts us
back to square one. Perfection is the enemy of the good and all that.

Having said that, I'd still be keen to learn about any practical attacks
on this in case there is something smart we can do to mitigate them
without cmpxchg(). For example, one silly approach might be to bound the
maximum increment and split larger ones up using a loop.

Will

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RESEND PATCH v4 05/10] lib/refcount: Improve performance of generic REFCOUNT_FULL code
  2020-02-28 10:43     ` Will Deacon
@ 2020-03-02 13:06       ` Jann Horn
  0 siblings, 0 replies; 3+ messages in thread
From: Jann Horn @ 2020-03-02 13:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel list, Kees Cook, Ingo Molnar, Elena Reshetova,
	Peter Zijlstra, Ard Biesheuvel, Hanjun Guo, Jan Glauber,
	Kernel Hardening

On Fri, Feb 28, 2020 at 11:43 AM Will Deacon <will@kernel.org> wrote:
> On Wed, Feb 26, 2020 at 05:10:46AM +0100, Jann Horn wrote:
> > On Thu, Nov 21, 2019 at 12:58 PM Will Deacon <will@kernel.org> wrote:
> > > + * If another thread also performs a refcount_inc() operation between the two
> > > + * atomic operations, then the count will continue to edge closer to 0. If it
> > > + * reaches a value of 1 before /any/ of the threads reset it to the saturated
> > > + * value, then a concurrent refcount_dec_and_test() may erroneously free the
> > > + * underlying object. Given the precise timing details involved with the
> > > + * round-robin scheduling of each thread manipulating the refcount and the need
> > > + * to hit the race multiple times in succession, there doesn't appear to be a
> > > + * practical avenue of attack even if using refcount_add() operations with
> > > + * larger increments.
> >
> > On top of that, the number of threads that can actually be running at
> > a given time is capped. See include/linux/threads.h, where it is
> > capped to pow(2, 22):
> >
> >     /*
> >      * A maximum of 4 million PIDs should be enough for a while.
> >      * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
> >      */
> >     #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
> >             (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
> >
> > And in the futex UAPI header, we have this, baking a TID limit into
> > the userspace API (note that this is pow(2,30), not pow(2,29) as the
> > comment in threads.h claims - I'm not sure where that difference comes
> > from):
> >
> >     /*
> >      * The rest of the robust-futex field is for the TID:
> >      */
> >     #define FUTEX_TID_MASK 0x3fffffff
> >
> > So AFAICS, with the current PID_MAX_LIMIT, if you assume that all
> > participating refcount operations are non-batched (delta 1) and the
> > attacker can't cause the threads to oops in the middle of the refcount
> > operation (maybe that would be possible if you managed to find
> > something like a NULL pointer dereference in perf software event code
> > and had perf paranoia at <=1 , or something like that? - I'm not
> > sure), then even in a theoretical scenario where an attacker spawns
> > the maximum number of tasks possible and manages to get all of them to
> > sequentially preempt while being in the middle of increment operations
> > in several nested contexts (I'm not sure whether that can even happen
> > - you're not going to take typical sleeping exceptions like page
> > faults in the middle of a refcount op), the attacker will stay
> > comfortably inside the saturated range. Even if the PID_MAX_LIMIT is
> > at some point raised to the maximum permitted by the futex UAPI, this
> > still holds as long as you assume no nesting. Hm, should I send a
> > patch to add something like this to the explanatory comment?
>
> Sure, I'd be happy to improve the document by adding this -- please send
> out a patch for review. It's probably also worth mentioning the batching
> use-cases, although I struggle to reason about the window between the
> {under,over}flow occuring and saturation.

In the overflow case, it's fine, right? If you briefly crossed into
the saturation range and then went back down, using some tasks with
half-completed refcounting operations, then the refcount is still
behaving as a correct non-saturating refcount. (And it can't cross
over to the other end of the saturation range, because that's twice as
much distance as you'd need to unpin a saturated refcount.)

And in the underflow case, we can't deterministically protect anyway
without some external mechanism to protect the object's lifetime while
someone is already freeing it - so that's pretty much just a
best-effort thing anyway.

> > Of course, if someone uses refcount batching with sufficiently large
> > values, those guarantees go out of the window - if we wanted to be
> > perfectionist about this, we could make the batched operations do slow
> > cmpxchg stuff while letting the much more performance-critical
> > single-reference case continue to use the fast saturation scheme.
> > OTOH, the networking folks would probably hate that, since they're
> > using the batched ops for ->sk_wmem_alloc stuff, where they count
> > bytes as references? So I guess maybe we should leave it as-is.
>
> Agreed. If we hamper the performance here, then people will either look
> to disable the checking or they will switch to atomic_t, which puts us
> back to square one. Perfection is the enemy of the good and all that.
>
> Having said that, I'd still be keen to learn about any practical attacks
> on this in case there is something smart we can do to mitigate them
> without cmpxchg(). For example, one silly approach might be to bound the
> maximum increment and split larger ones up using a loop.

I guess that'd do the job. I don't know whether it's worth the trouble
in practice though.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-03-02 13:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191121115902.2551-1-will@kernel.org>
     [not found] ` <20191121115902.2551-6-will@kernel.org>
2020-02-26  4:10   ` [RESEND PATCH v4 05/10] lib/refcount: Improve performance of generic REFCOUNT_FULL code Jann Horn
2020-02-28 10:43     ` Will Deacon
2020-03-02 13:06       ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).