containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [GIT PULL]  ucounts: Count rlimits in each user namespace
@ 2021-06-28 22:35 Eric W. Biederman
  2021-06-29  3:47 ` Linus Torvalds
  2021-06-29  3:50 ` [GIT PULL] ucounts: Count rlimits in each user namespace pr-tracker-bot
  0 siblings, 2 replies; 22+ messages in thread
From: Eric W. Biederman @ 2021-06-28 22:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Linux Containers


Linus,

Please pull the for-linus branch from the git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

  HEAD: 5e6b8a50a7cec5686ee2c4bda1d49899c79a7eae cred: add missing return error code when set_cred_ucounts() failed

This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users to
have stricter limits on the resources created within a user namespace.

There is a more detailed changelog from Alexey in the merge commit
9b624988221b ("ucounts: Count rlimits in each user namespace")

---

Alexey Gladkov (9):
      Increase size of ucounts to atomic_long_t
      Add a reference to ucounts for each cred
      Use atomic_t for ucounts reference counting
      Reimplement RLIMIT_NPROC on top of ucounts
      Reimplement RLIMIT_MSGQUEUE on top of ucounts
      Reimplement RLIMIT_SIGPENDING on top of ucounts
      Reimplement RLIMIT_MEMLOCK on top of ucounts
      kselftests: Add test to check for rlimit changes in different user namespaces
      ucounts: Set ucount_max to the largest positive value the type can hold

Eric W. Biederman (2):
      ucounts: Count rlimits in each user namespace
      ucounts: Silence warning in dec_rlimit_ucounts

Yang Yingliang (1):
      cred: add missing return error code when set_cred_ucounts() failed

 fs/exec.c                                          |   6 +-
 fs/hugetlbfs/inode.c                               |  16 +-
 fs/proc/array.c                                    |   2 +-
 include/linux/cred.h                               |   4 +
 include/linux/hugetlb.h                            |   4 +-
 include/linux/mm.h                                 |   4 +-
 include/linux/sched/user.h                         |   7 -
 include/linux/shmem_fs.h                           |   2 +-
 include/linux/signal_types.h                       |   4 +-
 include/linux/user_namespace.h                     |  31 +++-
 ipc/mqueue.c                                       |  40 ++---
 ipc/shm.c                                          |  26 ++--
 kernel/cred.c                                      |  51 ++++++-
 kernel/exit.c                                      |   2 +-
 kernel/fork.c                                      |  18 ++-
 kernel/signal.c                                    |  25 ++--
 kernel/sys.c                                       |  14 +-
 kernel/ucount.c                                    | 116 +++++++++++----
 kernel/user.c                                      |   3 -
 kernel/user_namespace.c                            |   9 +-
 mm/memfd.c                                         |   4 +-
 mm/mlock.c                                         |  22 ++-
 mm/mmap.c                                          |   4 +-
 mm/shmem.c                                         |  10 +-
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/rlimits/.gitignore         |   2 +
 tools/testing/selftests/rlimits/Makefile           |   6 +
 tools/testing/selftests/rlimits/config             |   1 +
 .../testing/selftests/rlimits/rlimits-per-userns.c | 161 +++++++++++++++++++++
 29 files changed, 468 insertions(+), 127 deletions(-)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-28 22:35 [GIT PULL] ucounts: Count rlimits in each user namespace Eric W. Biederman
@ 2021-06-29  3:47 ` Linus Torvalds
  2021-06-29 15:04   ` Eric W. Biederman
                     ` (2 more replies)
  2021-06-29  3:50 ` [GIT PULL] ucounts: Count rlimits in each user namespace pr-tracker-bot
  1 sibling, 3 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-06-29  3:47 UTC (permalink / raw)
  To: Eric W. Biederman, Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux Containers

On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This is the work mainly by Alexey Gladkov to limit rlimits to the
> rlimits of the user that created a user namespace, and to allow users to
> have stricter limits on the resources created within a user namespace.

I guess all the performance issues got sorted, since I haven't seen
any reports from the test robots.

I do end up with two questions, mainly because of looking at the
result of the conflict resolution.

In particular, in __sigqueue_alloc(), two oddities..

Why the "sigpending < LONG_MAX" test in that

        if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
task_rlimit(t, RLIMIT_SIGPENDING))) {

thing?

And why test for "ucounts" being non-NULL in

                if (ucounts && dec_rlimit_ucounts(ucounts,
UCOUNT_RLIMIT_SIGPENDING, 1))
                        put_ucounts(ucounts);

when afaik both of those should be happy with a NULL 'ucounts' pointer
(if it was NULL, we certainly already used it for the reverse
operations for get_ucounts() and inc_rlimit_ucounts()..)

Hmm?

And somebody should verify that I didn't screw anything up in my merge
resolution. It all looked very straightforward, but mistakes happen..

                   Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL]  ucounts: Count rlimits in each user namespace
  2021-06-28 22:35 [GIT PULL] ucounts: Count rlimits in each user namespace Eric W. Biederman
  2021-06-29  3:47 ` Linus Torvalds
@ 2021-06-29  3:50 ` pr-tracker-bot
  1 sibling, 0 replies; 22+ messages in thread
From: pr-tracker-bot @ 2021-06-29  3:50 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, linux-kernel, Linux Containers

The pull request you sent on Mon, 28 Jun 2021 17:35:22 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c54b245d011855ea91c5beff07f1db74143ce614

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29  3:47 ` Linus Torvalds
@ 2021-06-29 15:04   ` Eric W. Biederman
  2021-06-29 15:51   ` Eric W. Biederman
  2021-06-29 17:17   ` Alexey Gladkov
  2 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2021-06-29 15:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux Containers

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> This is the work mainly by Alexey Gladkov to limit rlimits to the
>> rlimits of the user that created a user namespace, and to allow users to
>> have stricter limits on the resources created within a user namespace.
>
> I guess all the performance issues got sorted, since I haven't seen
> any reports from the test robots.

Yes.  The structure was made to not change anything unnecessarily
(such as the ordering in sigqueue_alloc) and the performances
differences went away.

With the code in linux-next the entire cycle I think that is a reliable
result.  There are probably some things we could do to further optimize
things but we did not need them to avoid regressions.

> I do end up with two questions, mainly because of looking at the
> result of the conflict resolution.
>
> In particular, in __sigqueue_alloc(), two oddities..
>
> Why the "sigpending < LONG_MAX" test in that
>
>         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
>
> thing?

> And why test for "ucounts" being non-NULL in
>
>                 if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
>                         put_ucounts(ucounts);
>
> when afaik both of those should be happy with a NULL 'ucounts' pointer
> (if it was NULL, we certainly already used it for the reverse
> operations for get_ucounts() and inc_rlimit_ucounts()..)
>
> Hmm?

Yes.  I suspect that those tests are left over from a previous version
of the change.  Alex do you remember why those tests are there?

> And somebody should verify that I didn't screw anything up in my merge
> resolution. It all looked very straightforward, but mistakes happen..

Just reading through the resolution looks correct.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29  3:47 ` Linus Torvalds
  2021-06-29 15:04   ` Eric W. Biederman
@ 2021-06-29 15:51   ` Eric W. Biederman
  2021-06-29 16:34     ` Linus Torvalds
  2021-06-29 17:17   ` Alexey Gladkov
  2 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2021-06-29 15:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux Containers

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Why the "sigpending < LONG_MAX" test in that
>
>         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
> thing?

On second look that sigpending < LONG_MAX check is necessary.  When
inc_rlimit_ucounts detects a problem it returns LONG_MAX.

So the check for LONG_MAX is the condensed check to see if there is a
problem in any other levels of the ucount hierarchy.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 15:51   ` Eric W. Biederman
@ 2021-06-29 16:34     ` Linus Torvalds
  2021-06-29 16:42       ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2021-06-29 16:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > Why the "sigpending < LONG_MAX" test in that
> >
> >         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> > task_rlimit(t, RLIMIT_SIGPENDING))) {
> > thing?
>
> On second look that sigpending < LONG_MAX check is necessary.  When
> inc_rlimit_ucounts detects a problem it returns LONG_MAX.

I saw that, but _without_ that test you'd be left with just that

    sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)

and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.

IOW, afaik even _if_ sigpending ends up being LONG_MAX, the
conditional does the right thing without that "sigpending < LONG_MAX"
test.

                Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 16:34     ` Linus Torvalds
@ 2021-06-29 16:42       ` Eric W. Biederman
  2021-06-29 17:09         ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2021-06-29 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux Containers

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>
>> > Why the "sigpending < LONG_MAX" test in that
>> >
>> >         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>> > thing?
>>
>> On second look that sigpending < LONG_MAX check is necessary.  When
>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>
> I saw that, but _without_ that test you'd be left with just that
>
>     sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>
> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.

It means no limits locally.  The creator of your user namespace might
have had a limit which you are also bound by.

The other possibility is that inc_rlimits_ucounts caused a sigpending
counter to overflow.  In which case we need to fail and run
dec_rlimit_ucounts to keep the counter from staying overflowed.

So I don't see a clever way to avoid the sigpending < LONG_MAX  test.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 16:42       ` Eric W. Biederman
@ 2021-06-29 17:09         ` Eric W. Biederman
  2021-07-01 16:41           ` Alexey Gladkov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2021-06-29 17:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux Containers

ebiederm@xmission.com (Eric W. Biederman) writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>>
>>> > Why the "sigpending < LONG_MAX" test in that
>>> >
>>> >         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>>> > thing?
>>>
>>> On second look that sigpending < LONG_MAX check is necessary.  When
>>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>>
>> I saw that, but _without_ that test you'd be left with just that
>>
>>     sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>>
>> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
>
> It means no limits locally.  The creator of your user namespace might
> have had a limit which you are also bound by.
>
> The other possibility is that inc_rlimits_ucounts caused a sigpending
> counter to overflow.  In which case we need to fail and run
> dec_rlimit_ucounts to keep the counter from staying overflowed.
>
> So I don't see a clever way to avoid the sigpending < LONG_MAX  test.

Hmm.  I take that back.  There is a simple clever way to satisfy all of
the tests.

- sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
+ sigpending < task_rlimit(t, RLIMIT_SIGPENDING)

That would just need a small comment to explain the subtleties.  

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29  3:47 ` Linus Torvalds
  2021-06-29 15:04   ` Eric W. Biederman
  2021-06-29 15:51   ` Eric W. Biederman
@ 2021-06-29 17:17   ` Alexey Gladkov
  2021-06-29 18:07     ` Linus Torvalds
  2 siblings, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-06-29 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Mon, Jun 28, 2021 at 08:47:12PM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > This is the work mainly by Alexey Gladkov to limit rlimits to the
> > rlimits of the user that created a user namespace, and to allow users to
> > have stricter limits on the resources created within a user namespace.
> 
> I guess all the performance issues got sorted, since I haven't seen
> any reports from the test robots.
> 
> I do end up with two questions, mainly because of looking at the
> result of the conflict resolution.
> 
> In particular, in __sigqueue_alloc(), two oddities..
> 
> Why the "sigpending < LONG_MAX" test in that
> 
>         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
> 
> thing?

inc_rlimit_ucounts() returns long and uses LONG_MAX as an overflow flag.
At the same time, we have increased the size of sigpending from int to
long.

> And why test for "ucounts" being non-NULL in
> 
>                 if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
>                         put_ucounts(ucounts);
> 
> when afaik both of those should be happy with a NULL 'ucounts' pointer
> (if it was NULL, we certainly already used it for the reverse
> operations for get_ucounts() and inc_rlimit_ucounts()..)

The get_ucount() can theoretically return NULL. It increments the
reference counter and if it overflows, the function will return NULL.

> Hmm?
> 
> And somebody should verify that I didn't screw anything up in my merge
> resolution. It all looked very straightforward, but mistakes happen..

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 17:17   ` Alexey Gladkov
@ 2021-06-29 18:07     ` Linus Torvalds
  2021-06-29 20:20       ` Alexey Gladkov
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2021-06-29 18:07 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 10:18 AM Alexey Gladkov <legion@kernel.org> wrote:
>
>
> > And why test for "ucounts" being non-NULL in
> >
> >                 if (ucounts && dec_rlimit_ucounts(ucounts,
> > UCOUNT_RLIMIT_SIGPENDING, 1))
> >                         put_ucounts(ucounts);
> >
> > when afaik both of those should be happy with a NULL 'ucounts' pointer
> > (if it was NULL, we certainly already used it for the reverse
> > operations for get_ucounts() and inc_rlimit_ucounts()..)
>
> The get_ucount() can theoretically return NULL. It increments the
> reference counter and if it overflows, the function will return NULL.

.. but my point is that dec_rlimit_ucounts() and put_ucounts() should
be fine with whatever get_ucounts() returned. No

It looks like put_ucounts() is unhappy with a NULL ucounts argument,
but I think _that_ is what should get fixed.

I think that conceptually we should have two clear alternatives:

 (a) either "get_ucounts()" returning NULL should be an error, and we
would have returned long before

or

 (b) a NULL uncounts is usable, and a sequence like
put_ucounts(get_ucounts()) should just always work.

And honestly, a lot of the other ucounts funcrtions seem to take that
(b) approach. Example in that very function:

        ucounts = task_ucounts(t);
        sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);

which at no point tested for NULL or returned an error.

(And that also implies that the comment in dec_rlimit_ucounts() about
"Silence compiler warning" should just go away, because it's not just
a compiler warning, it's a required initialization).

              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 18:07     ` Linus Torvalds
@ 2021-06-29 20:20       ` Alexey Gladkov
  2021-06-29 20:33         ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-06-29 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 11:07:11AM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 10:18 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> >
> > > And why test for "ucounts" being non-NULL in
> > >
> > >                 if (ucounts && dec_rlimit_ucounts(ucounts,
> > > UCOUNT_RLIMIT_SIGPENDING, 1))
> > >                         put_ucounts(ucounts);
> > >
> > > when afaik both of those should be happy with a NULL 'ucounts' pointer
> > > (if it was NULL, we certainly already used it for the reverse
> > > operations for get_ucounts() and inc_rlimit_ucounts()..)
> >
> > The get_ucount() can theoretically return NULL. It increments the
> > reference counter and if it overflows, the function will return NULL.
> 
> .. but my point is that dec_rlimit_ucounts() and put_ucounts() should
> be fine with whatever get_ucounts() returned. No
> 
> It looks like put_ucounts() is unhappy with a NULL ucounts argument,
> but I think _that_ is what should get fixed.
> 
> I think that conceptually we should have two clear alternatives:
> 
>  (a) either "get_ucounts()" returning NULL should be an error, and we
> would have returned long before

get_ucounts() in the __sigqueue_alloc() performs the get_uid() function
but does not ignore the counter overflow. Basically get_uid() can fail in
same way as get_ucounts(), but we just ignore it.

> or
> 
>  (b) a NULL uncounts is usable, and a sequence like
> put_ucounts(get_ucounts()) should just always work.
> 
> And honestly, a lot of the other ucounts funcrtions seem to take that
> (b) approach. Example in that very function:
> 
>         ucounts = task_ucounts(t);
>         sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> 
> which at no point tested for NULL or returned an error.

Waaaait. task_ucounts() is a different thing. This function only gets a
field from the task structure without any reference counting. But the
get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
counter overflow.

Earlier I tried to use refcount_t which never returns errors [1]. We
talked and you said that ignoring counter overflow errors is bad
design for this case.

> (And that also implies that the comment in dec_rlimit_ucounts() about
> "Silence compiler warning" should just go away, because it's not just
> a compiler warning, it's a required initialization).
> 
>               Linus

[1] https://lore.kernel.org/lkml/CAHk-%3dwjYOCgM%2bmKzwTZwkDDg12DdYjFFkmoFKYLim7NFmR9HBg@mail.gmail.com/

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 20:20       ` Alexey Gladkov
@ 2021-06-29 20:33         ` Linus Torvalds
  2021-06-29 21:22           ` Alexey Gladkov
  2021-07-02 17:54           ` [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak Alexey Gladkov
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-06-29 20:33 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 1:20 PM Alexey Gladkov <legion@kernel.org> wrote:
>
> Waaaait. task_ucounts() is a different thing. This function only gets a
> field from the task structure without any reference counting. But the
> get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
> counter overflow.

Alexey, that code cannot be right.

Look here:

        rcu_read_lock();
        ucounts = task_ucounts(t);
        sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
        if (sigpending == 1)
                ucounts = get_ucounts(ucounts);
        rcu_read_unlock();

so now we've done that inc_rlimit_ucounts() unconditionally on that
task_ucounts() thing.

And then if the allocation fails (or the limit is hit) the code does

        if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
                put_ucounts(ucounts);

ie now it does the dec_rlimit_ucounts _conditionally_.

See what I'm complaining about? This is not logical, AND IT CANNOT
POSSIBLY BE RIGHT.

My argument is that

 (a) the dec_rlimit_ucounts() has to pair up with
inc_rlimit_ucounts(), or you're leaking counts

 (b) get_ucounts() has to pair up with put_ucounts().

Note that (a) has to be REGARDLESS of whether get_ucounts() was
successful or not.

> Earlier I tried to use refcount_t which never returns errors [1]. We
> talked and you said that ignoring counter overflow errors is bad
> design for this case.

You can't ignore counter overflow errors, no. But that's exactly what
that code is doing.

If get_ucount() fails due to overflow, you don't return an error. You
just miscount the end result!

So yeah, its' "testing" the overflow condition, but that's not an
argument, when it then DOES EXPLICITLY THE WRONG THING.

At that point, the test is actively harmful and wrong. See?

             Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 20:33         ` Linus Torvalds
@ 2021-06-29 21:22           ` Alexey Gladkov
  2021-07-02 17:54           ` [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak Alexey Gladkov
  1 sibling, 0 replies; 22+ messages in thread
From: Alexey Gladkov @ 2021-06-29 21:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 01:33:39PM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 1:20 PM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > Waaaait. task_ucounts() is a different thing. This function only gets a
> > field from the task structure without any reference counting. But the
> > get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
> > counter overflow.
> 
> Alexey, that code cannot be right.
> 
> Look here:
> 
>         rcu_read_lock();
>         ucounts = task_ucounts(t);
>         sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>         if (sigpending == 1)
>                 ucounts = get_ucounts(ucounts);
>         rcu_read_unlock();
> 
> so now we've done that inc_rlimit_ucounts() unconditionally on that
> task_ucounts() thing.
> 
> And then if the allocation fails (or the limit is hit) the code does
> 
>         if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
>                 put_ucounts(ucounts);
> 
> ie now it does the dec_rlimit_ucounts _conditionally_.
> 
> See what I'm complaining about? This is not logical, AND IT CANNOT
> POSSIBLY BE RIGHT.
> 
> My argument is that
> 
>  (a) the dec_rlimit_ucounts() has to pair up with
> inc_rlimit_ucounts(), or you're leaking counts
> 
>  (b) get_ucounts() has to pair up with put_ucounts().
> 
> Note that (a) has to be REGARDLESS of whether get_ucounts() was
> successful or not.
> 
> > Earlier I tried to use refcount_t which never returns errors [1]. We
> > talked and you said that ignoring counter overflow errors is bad
> > design for this case.
> 
> You can't ignore counter overflow errors, no. But that's exactly what
> that code is doing.
> 
> If get_ucount() fails due to overflow, you don't return an error. You
> just miscount the end result!
> 
> So yeah, its' "testing" the overflow condition, but that's not an
> argument, when it then DOES EXPLICITLY THE WRONG THING.
> 
> At that point, the test is actively harmful and wrong. See?

Yes. Please, give me some time to fix it.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-06-29 17:09         ` Eric W. Biederman
@ 2021-07-01 16:41           ` Alexey Gladkov
  2021-07-01 20:05             ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-07-01 16:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, Linux Containers

On Tue, Jun 29, 2021 at 12:09:01PM -0500, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > Linus Torvalds <torvalds@linux-foundation.org> writes:
> >
> >> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>>
> >>> Linus Torvalds <torvalds@linux-foundation.org> writes:
> >>>
> >>> > Why the "sigpending < LONG_MAX" test in that
> >>> >
> >>> >         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> >>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
> >>> > thing?
> >>>
> >>> On second look that sigpending < LONG_MAX check is necessary.  When
> >>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
> >>
> >> I saw that, but _without_ that test you'd be left with just that
> >>
> >>     sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
> >>
> >> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
> >
> > It means no limits locally.  The creator of your user namespace might
> > have had a limit which you are also bound by.
> >
> > The other possibility is that inc_rlimits_ucounts caused a sigpending
> > counter to overflow.  In which case we need to fail and run
> > dec_rlimit_ucounts to keep the counter from staying overflowed.
> >
> > So I don't see a clever way to avoid the sigpending < LONG_MAX  test.
> 
> Hmm.  I take that back.  There is a simple clever way to satisfy all of
> the tests.
> 
> - sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
> + sigpending < task_rlimit(t, RLIMIT_SIGPENDING)
> 
> That would just need a small comment to explain the subtleties.  

Is it because user.sigpending was atomic_t before this patch ?

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [GIT PULL] ucounts: Count rlimits in each user namespace
  2021-07-01 16:41           ` Alexey Gladkov
@ 2021-07-01 20:05             ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2021-07-01 20:05 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Linux Containers

Alexey Gladkov <legion@kernel.org> writes:

> On Tue, Jun 29, 2021 at 12:09:01PM -0500, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>> > Linus Torvalds <torvalds@linux-foundation.org> writes:
>> >
>> >> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >>>
>> >>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> >>>
>> >>> > Why the "sigpending < LONG_MAX" test in that
>> >>> >
>> >>> >         if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>> >>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>> >>> > thing?
>> >>>
>> >>> On second look that sigpending < LONG_MAX check is necessary.  When
>> >>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>> >>
>> >> I saw that, but _without_ that test you'd be left with just that
>> >>
>> >>     sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>> >>
>> >> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
>> >
>> > It means no limits locally.  The creator of your user namespace might
>> > have had a limit which you are also bound by.
>> >
>> > The other possibility is that inc_rlimits_ucounts caused a sigpending
>> > counter to overflow.  In which case we need to fail and run
>> > dec_rlimit_ucounts to keep the counter from staying overflowed.
>> >
>> > So I don't see a clever way to avoid the sigpending < LONG_MAX  test.
>> 
>> Hmm.  I take that back.  There is a simple clever way to satisfy all of
>> the tests.
>> 
>> - sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>> + sigpending < task_rlimit(t, RLIMIT_SIGPENDING)
>> 
>> That would just need a small comment to explain the subtleties.  
>
> Is it because user.sigpending was atomic_t before this patch ?

Apologies I was wrong.

The replacement of "<=" with "<" is correct for the case where
"task_rlimit(t, RLIMIT_SIGPENDING) == LONG_MAX".

Unfortunately off by one for all other values of
"task_rlimit(t, RLIMIT_SIGPENDING)".

It completely breaks things for the case where RLIMIT_SIGPENDING == 1,
where no signals are allowed to be queued.  Today allowing 1 queued
signal with a single task and a sender that does not send a second
signal until the first is consumed will work reliably.

That was just a brain fart on my part.

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-06-29 20:33         ` Linus Torvalds
  2021-06-29 21:22           ` Alexey Gladkov
@ 2021-07-02 17:54           ` Alexey Gladkov
  2021-07-02 22:13             ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-07-02 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers,
	Alexey Gladkov

If get_ucounts() could not increase the reference counter, then in case
of an error it will be impossible to decrease the counter.

In case of an error, the ucounts variable cannot be set to NULL. Also
dec_rlimit_ucounts() has to be REGARDLESS of whether get_ucounts() was
successful or not.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 kernel/signal.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index de0920353d30..87a64b3307a8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,7 +412,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 		 int override_rlimit, const unsigned int sigqueue_flags)
 {
 	struct sigqueue *q = NULL;
-	struct ucounts *ucounts = NULL;
+	struct ucounts *ucounts, *ucounts_new;
 	long sigpending;
 
 	/*
@@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	 * changes from/to zero.
 	 */
 	rcu_read_lock();
-	ucounts = task_ucounts(t);
+	ucounts = ucounts_new = task_ucounts(t);
 	sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
 	if (sigpending == 1)
-		ucounts = get_ucounts(ucounts);
+		ucounts_new = get_ucounts(ucounts);
 	rcu_read_unlock();
 
 	if (override_rlimit || (sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
@@ -437,13 +437,24 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	}
 
 	if (unlikely(q == NULL)) {
-		if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
-			put_ucounts(ucounts);
+		ucounts_new = NULL;
 	} else {
 		INIT_LIST_HEAD(&q->list);
 		q->flags = sigqueue_flags;
-		q->ucounts = ucounts;
+		q->ucounts = ucounts_new;
 	}
+
+	/*
+	 * In case it failed to allocate sigqueue or ucounts reference counter
+	 * overflow, we decrement UCOUNT_RLIMIT_SIGPENDING to avoid counter
+	 * leaks.
+	 */
+	if (unlikely(ucounts_new == NULL)) {
+		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+		if (sigpending == 1)
+			put_ucounts(ucounts);
+	}
+
 	return q;
 }
 
@@ -451,7 +462,7 @@ static void __sigqueue_free(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
-	if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
+	if (dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
 		put_ucounts(q->ucounts);
 		q->ucounts = NULL;
 	}
-- 
2.29.3


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-02 17:54           ` [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak Alexey Gladkov
@ 2021-07-02 22:13             ` Linus Torvalds
  2021-07-07 16:50               ` Alexey Gladkov
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2021-07-02 22:13 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Fri, Jul 2, 2021 at 10:55 AM Alexey Gladkov <legion@kernel.org> wrote:
>
> @@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>          * changes from/to zero.
>          */
>         rcu_read_lock();
> -       ucounts = task_ucounts(t);
> +       ucounts = ucounts_new = task_ucounts(t);
>         sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>         if (sigpending == 1)
> -               ucounts = get_ucounts(ucounts);
> +               ucounts_new = get_ucounts(ucounts);
>         rcu_read_unlock();

I think this is still problematic.

If get_ucounts() fails, we can't just drop the RCU lock and (later)
use "ucounts" that we hold no reference to.

Or am I missing something? I'm not entirely sure about the lifetime of
that RCU protection, and I do note that "task_ucounts()" uses
"task_cred_xxx()", which already does
rcu_read_lock()/rcu_read_unlock() in the actual access.

So I'm thinking the code could/should be written something like this instead:

  diff --git a/kernel/signal.c b/kernel/signal.c
  index f6371dfa1f89..40781b197227 100644
  --- a/kernel/signal.c
  +++ b/kernel/signal.c
  @@ -422,22 +422,33 @@ __sigqueue_alloc(int sig, struct task_struct
*t, gfp_t gfp_flags,
         * NOTE! A pending signal will hold on to the user refcount,
         * and we get/put the refcount only when the sigpending count
         * changes from/to zero.
  +      *
  +      * And if the ucount rlimit overflowed, we do not get to use it at all.
         */
        rcu_read_lock();
        ucounts = task_ucounts(t);
        sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
  -     if (sigpending == 1)
  -             ucounts = get_ucounts(ucounts);
  +     switch (sigpending) {
  +     case 1:
  +             if (likely(get_ucounts(ucounts)))
  +                     break;
  +
  +             dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
  +             fallthrough;
  +     case LONG_MAX:
  +             rcu_read_unlock();
  +             return NULL;
  +     }
        rcu_read_unlock();

  -     if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
task_rlimit(t, RLIMIT_SIGPENDING))) {
  +     if (override_rlimit || sigpending <= task_rlimit(t,
RLIMIT_SIGPENDING)) {
                q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
        } else {
                print_dropped_signal(sig);
        }

        if (unlikely(q == NULL)) {
  -             if (ucounts && dec_rlimit_ucounts(ucounts,
UCOUNT_RLIMIT_SIGPENDING, 1))
  +             if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
                        put_ucounts(ucounts);
        } else {
                INIT_LIST_HEAD(&q->list);

(and no, I'm not sure it's a good idea to make that use a "switch()" -
maybe the LONG_MAX case should be a "if (unlikely())" thing after the
rcu_read_ulock() instead?

Hmm?

The alternate thing is to say "No, Linus, you're a nincompoop and
wrong, that RCU protection is a non-issue because we hold a reference
to the task, and task_ucounts() will not change, so the RCU read lock
doesn't do anything".

Although then I would think the rcu_read_lock/rcu_read_unlock here is
entirely pointless.

               Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-02 22:13             ` Linus Torvalds
@ 2021-07-07 16:50               ` Alexey Gladkov
  2021-07-07 17:23                 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-07-07 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Fri, Jul 02, 2021 at 03:13:18PM -0700, Linus Torvalds wrote:
> On Fri, Jul 2, 2021 at 10:55 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > @@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> >          * changes from/to zero.
> >          */
> >         rcu_read_lock();
> > -       ucounts = task_ucounts(t);
> > +       ucounts = ucounts_new = task_ucounts(t);
> >         sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> >         if (sigpending == 1)
> > -               ucounts = get_ucounts(ucounts);
> > +               ucounts_new = get_ucounts(ucounts);
> >         rcu_read_unlock();
> 
> I think this is still problematic.
> 
> If get_ucounts() fails, we can't just drop the RCU lock and (later)
> use "ucounts" that we hold no reference to.
> 
> Or am I missing something? I'm not entirely sure about the lifetime of
> that RCU protection, and I do note that "task_ucounts()" uses
> "task_cred_xxx()", which already does
> rcu_read_lock()/rcu_read_unlock() in the actual access.
> 
> So I'm thinking the code could/should be written something like this instead:
> 
>   diff --git a/kernel/signal.c b/kernel/signal.c
>   index f6371dfa1f89..40781b197227 100644
>   --- a/kernel/signal.c
>   +++ b/kernel/signal.c
>   @@ -422,22 +422,33 @@ __sigqueue_alloc(int sig, struct task_struct
> *t, gfp_t gfp_flags,
>          * NOTE! A pending signal will hold on to the user refcount,
>          * and we get/put the refcount only when the sigpending count
>          * changes from/to zero.
>   +      *
>   +      * And if the ucount rlimit overflowed, we do not get to use it at all.
>          */
>         rcu_read_lock();
>         ucounts = task_ucounts(t);
>         sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>   -     if (sigpending == 1)
>   -             ucounts = get_ucounts(ucounts);
>   +     switch (sigpending) {
>   +     case 1:
>   +             if (likely(get_ucounts(ucounts)))
>   +                     break;
>   +
>   +             dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>   +             fallthrough;
>   +     case LONG_MAX:

I think that the counter should be decreased in this case too.
inc_rlimit_ucounts() increments the counter in all parent userns. If we
don't decrease the counter then the parent userns will have a counter
leak.

>   +             rcu_read_unlock();
>   +             return NULL;
>   +     }
>         rcu_read_unlock();
> 
>   -     if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
>   +     if (override_rlimit || sigpending <= task_rlimit(t,
> RLIMIT_SIGPENDING)) {
>                 q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
>         } else {
>                 print_dropped_signal(sig);
>         }
> 
>         if (unlikely(q == NULL)) {
>   -             if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
>   +             if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
>                         put_ucounts(ucounts);
>         } else {
>                 INIT_LIST_HEAD(&q->list);
> 
> (and no, I'm not sure it's a good idea to make that use a "switch()" -
> maybe the LONG_MAX case should be a "if (unlikely())" thing after the
> rcu_read_ulock() instead?
> 
> Hmm?

You are absolutely right. This fixes a BUG I just got during internal
testing.

> The alternate thing is to say "No, Linus, you're a nincompoop and
> wrong, that RCU protection is a non-issue because we hold a reference
> to the task, and task_ucounts() will not change, so the RCU read lock
> doesn't do anything".
> 
> Although then I would think the rcu_read_lock/rcu_read_unlock here is
> entirely pointless.
> 
>                Linus
> 

[   84.670919] ==================================================================
[   84.673326] BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
[   84.674983] Write of size 4 at addr ffff8880045f031c by task swapper/2/0
[   84.676702]
[   84.677119] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
[   84.678920] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
[   84.683128] Call Trace:
[   84.683950]  <IRQ>
[   84.684635]  dump_stack+0x8a/0xb5
[   84.685751]  print_address_description.constprop.0+0x18/0x130
[   84.687633]  ? put_ucounts+0x17/0xa0
[   84.688831]  ? put_ucounts+0x17/0xa0
[   84.690050]  kasan_report.cold+0x7f/0x111
[   84.691413]  ? put_ucounts+0x17/0xa0
[   84.692618]  kasan_check_range+0xf9/0x1e0
[   84.694039]  put_ucounts+0x17/0xa0
[   84.695237]  put_cred_rcu+0xd5/0x190
[   84.696998]  rcu_core+0x3bf/0xcb0
[   84.699692]  ? note_gp_changes+0x90/0x90
[   84.700938]  ? recalibrate_cpu_khz+0x10/0x10
[   84.702575]  ? lapic_timer_set_periodic+0x30/0x30
[   84.704152]  ? clockevents_program_event+0xd3/0x130
[   84.705754]  ? hrtimer_interrupt+0x418/0x440
[   84.707198]  __do_softirq+0xe3/0x341
[   84.708412]  irq_exit_rcu+0xbe/0xe0
[   84.709572]  sysvec_apic_timer_interrupt+0x6a/0x90
[   84.711235]  </IRQ>
[   84.711972]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[   84.713661] RIP: 0010:default_idle+0xb/0x10
[   84.715070] Code: ff f0 80 63 02 df 5b 41 5c c3 0f ae f0 0f ae 3b 0f ae f0 eb 90 66 2e 0f 1f 84 00 00 00 00 00 eb 07 0f 00 2d 37 7d 5a 5
[   84.721579] RSP: 0018:ffffc900000dfe80 EFLAGS: 00000202
[   84.723330] RAX: ffffffffad279280 RBX: ffff8880013f3e00 RCX: ffffffffad269c56
[   84.725724] RDX: 000000000001a8c6 RSI: 0000000000000004 RDI: ffff8880361322a0
[   84.728085] RBP: 0000000000000002 R08: 0000000000000001 R09: ffff8880361322a3
[   84.730485] R10: ffffed1006c26454 R11: 0000000000000001 R12: 0000000000000002
[   84.732814] R13: 0000000000000000 R14: 0000000000000000 R15: 1ffff9200001bfd6
[   84.736325]  ? mwait_idle+0xc0/0xc0
[   84.737488]  ? rcu_eqs_enter.constprop.0+0x86/0xa0
[   84.739085]  default_idle_call+0x53/0x130
[   84.740410]  do_idle+0x311/0x3c0
[   84.741512]  ? _raw_spin_lock_irqsave+0x7b/0xd0
[   84.743060]  ? arch_cpu_idle_exit+0x30/0x30
[   84.744442]  ? swake_up_locked+0x6d/0x80
[   84.745673]  cpu_startup_entry+0x14/0x20
[   84.746657]  secondary_startup_64_no_verify+0xc2/0xcb
[   84.747746]
[   84.748073] Allocated by task 127:
[   84.748742]  kasan_save_stack+0x1b/0x40
[   84.750483]  __kasan_kmalloc+0x7c/0x90
[   84.751359]  alloc_ucounts+0x169/0x2b0
[   84.752270]  set_cred_ucounts+0xbb/0x170
[   84.753266]  ksys_unshare+0x24c/0x4e0
[   84.754194]  __x64_sys_unshare+0x16/0x20
[   84.755332]  do_syscall_64+0x37/0x70
[   84.756503]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   84.758106]
[   84.758625] Freed by task 0:
[   84.759589]  kasan_save_stack+0x1b/0x40
[   84.760841]  kasan_set_track+0x1c/0x30
[   84.762082]  kasan_set_free_info+0x20/0x30
[   84.763421]  __kasan_slab_free+0xeb/0x120
[   84.764736]  kfree+0xaa/0x460
[   84.765722]  put_cred_rcu+0xd5/0x190
[   84.766925]  rcu_core+0x3bf/0xcb0
[   84.768074]  __do_softirq+0xe3/0x341
[   84.769262]
[   84.769765] The buggy address belongs to the object at ffff8880045f0300
[   84.769765]  which belongs to the cache kmalloc-192 of size 192
[   84.773675] The buggy address is located 28 bytes inside of
[   84.773675]  192-byte region [ffff8880045f0300, ffff8880045f03c0)
[   84.778876] The buggy address belongs to the page:
[   84.780338] page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
[   84.782844] flags: 0x100000000000200(slab|node=0|zone=1)
[   84.784640] raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
[   84.787229] raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
[   84.789685] page dumped because: kasan: bad access detected
[   84.791511]
[   84.792037] Memory state around the buggy address:
[   84.793590]  ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.795920]  ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[   84.798252] >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.800636]                             ^
[   84.801953]  ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[   84.804261]  ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.806622] ==================================================================
[   84.808909] Disabling lock debugging due to kernel taint

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-07 16:50               ` Alexey Gladkov
@ 2021-07-07 17:23                 ` Linus Torvalds
  2021-07-08 10:33                   ` [PATCH v2] " Alexey Gladkov
  2021-07-08 11:00                   ` [PATCH] ucounts: " Alexey Gladkov
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-07-07 17:23 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Wed, Jul 7, 2021 at 9:50 AM Alexey Gladkov <legion@kernel.org> wrote:
>
> >   +             dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> >   +             fallthrough;
> >   +     case LONG_MAX:
>
> I think that the counter should be decreased in this case too.
> inc_rlimit_ucounts() increments the counter in all parent userns. If we
> don't decrease the counter then the parent userns will have a counter
> leak.

Ack. So basically that patch, but move the dec_rlimit_ucounts() into
the LONG_MAX case?

Would you mind making a real patch with a commit message, and trying
whatever test-case you had for that KASAN use-after-free report?

              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-07 17:23                 ` Linus Torvalds
@ 2021-07-08 10:33                   ` Alexey Gladkov
  2021-07-08 18:44                     ` Linus Torvalds
  2021-07-08 11:00                   ` [PATCH] ucounts: " Alexey Gladkov
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey Gladkov @ 2021-07-08 10:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Linux Containers, Alexey Gladkov,
	Eric W . Biederman, Oleg Nesterov

We must properly handle an errors when we increase the rlimit counter
and the ucounts reference counter. We have to this with RCU protection
to prevent possible use-after-free that could occur due to concurrent
put_cred_rcu().

The following reproducer triggers the problem:

$ cat testcase.sh
case "${STEP:-0}" in
0)
	ulimit -Si 1
	ulimit -Hi 1
	STEP=1 unshare -rU "$0"
	killall sleep
	;;
1)
	for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
	;;
esac

[   84.670919] ==================================================================
[   84.673326] BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
[   84.674983] Write of size 4 at addr ffff8880045f031c by task swapper/2/0
[   84.676702]
[   84.677119] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
[   84.678920] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
[   84.683128] Call Trace:
[   84.683950]  <IRQ>
[   84.684635]  dump_stack+0x8a/0xb5
[   84.685751]  print_address_description.constprop.0+0x18/0x130
[   84.687633]  ? put_ucounts+0x17/0xa0
[   84.688831]  ? put_ucounts+0x17/0xa0
[   84.690050]  kasan_report.cold+0x7f/0x111
[   84.691413]  ? put_ucounts+0x17/0xa0
[   84.692618]  kasan_check_range+0xf9/0x1e0
[   84.694039]  put_ucounts+0x17/0xa0
[   84.695237]  put_cred_rcu+0xd5/0x190
[   84.696998]  rcu_core+0x3bf/0xcb0
[   84.699692]  ? note_gp_changes+0x90/0x90
[   84.700938]  ? recalibrate_cpu_khz+0x10/0x10
[   84.702575]  ? lapic_timer_set_periodic+0x30/0x30
[   84.704152]  ? clockevents_program_event+0xd3/0x130
[   84.705754]  ? hrtimer_interrupt+0x418/0x440
[   84.707198]  __do_softirq+0xe3/0x341
[   84.708412]  irq_exit_rcu+0xbe/0xe0
[   84.709572]  sysvec_apic_timer_interrupt+0x6a/0x90
[   84.711235]  </IRQ>
[   84.711972]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[   84.713661] RIP: 0010:default_idle+0xb/0x10
[   84.715070] Code: ff f0 80 63 02 df 5b 41 5c c3 0f ae f0 0f ae 3b 0f ae f0 eb 90 66 2e 0f 1f 84 00 00 00 00 00 eb 07 0f 00 2d 37 7d 5a 5
[   84.721579] RSP: 0018:ffffc900000dfe80 EFLAGS: 00000202
[   84.723330] RAX: ffffffffad279280 RBX: ffff8880013f3e00 RCX: ffffffffad269c56
[   84.725724] RDX: 000000000001a8c6 RSI: 0000000000000004 RDI: ffff8880361322a0
[   84.728085] RBP: 0000000000000002 R08: 0000000000000001 R09: ffff8880361322a3
[   84.730485] R10: ffffed1006c26454 R11: 0000000000000001 R12: 0000000000000002
[   84.732814] R13: 0000000000000000 R14: 0000000000000000 R15: 1ffff9200001bfd6
[   84.736325]  ? mwait_idle+0xc0/0xc0
[   84.737488]  ? rcu_eqs_enter.constprop.0+0x86/0xa0
[   84.739085]  default_idle_call+0x53/0x130
[   84.740410]  do_idle+0x311/0x3c0
[   84.741512]  ? _raw_spin_lock_irqsave+0x7b/0xd0
[   84.743060]  ? arch_cpu_idle_exit+0x30/0x30
[   84.744442]  ? swake_up_locked+0x6d/0x80
[   84.745673]  cpu_startup_entry+0x14/0x20
[   84.746657]  secondary_startup_64_no_verify+0xc2/0xcb
[   84.747746]
[   84.748073] Allocated by task 127:
[   84.748742]  kasan_save_stack+0x1b/0x40
[   84.750483]  __kasan_kmalloc+0x7c/0x90
[   84.751359]  alloc_ucounts+0x169/0x2b0
[   84.752270]  set_cred_ucounts+0xbb/0x170
[   84.753266]  ksys_unshare+0x24c/0x4e0
[   84.754194]  __x64_sys_unshare+0x16/0x20
[   84.755332]  do_syscall_64+0x37/0x70
[   84.756503]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   84.758106]
[   84.758625] Freed by task 0:
[   84.759589]  kasan_save_stack+0x1b/0x40
[   84.760841]  kasan_set_track+0x1c/0x30
[   84.762082]  kasan_set_free_info+0x20/0x30
[   84.763421]  __kasan_slab_free+0xeb/0x120
[   84.764736]  kfree+0xaa/0x460
[   84.765722]  put_cred_rcu+0xd5/0x190
[   84.766925]  rcu_core+0x3bf/0xcb0
[   84.768074]  __do_softirq+0xe3/0x341
[   84.769262]
[   84.769765] The buggy address belongs to the object at ffff8880045f0300
[   84.769765]  which belongs to the cache kmalloc-192 of size 192
[   84.773675] The buggy address is located 28 bytes inside of
[   84.773675]  192-byte region [ffff8880045f0300, ffff8880045f03c0)
[   84.778876] The buggy address belongs to the page:
[   84.780338] page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
[   84.782844] flags: 0x100000000000200(slab|node=0|zone=1)
[   84.784640] raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
[   84.787229] raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
[   84.789685] page dumped because: kasan: bad access detected
[   84.791511]
[   84.792037] Memory state around the buggy address:
[   84.793590]  ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.795920]  ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[   84.798252] >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.800636]                             ^
[   84.801953]  ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[   84.804261]  ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   84.806622] ==================================================================
[   84.808909] Disabling lock debugging due to kernel taint

Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 kernel/signal.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index de0920353d30..e1e4d6d07f08 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -426,18 +426,30 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	rcu_read_lock();
 	ucounts = task_ucounts(t);
 	sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
-	if (sigpending == 1)
-		ucounts = get_ucounts(ucounts);
+	switch (sigpending) {
+	case 1:
+		if (likely(get_ucounts(ucounts)))
+			break;
+		fallthrough;
+	case LONG_MAX:
+		/*
+		 * we need to decrease the ucount in the userns tree on any
+		 * failure to avoid counts leaking.
+		 */
+		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+		rcu_read_unlock();
+		return NULL;
+	}
 	rcu_read_unlock();
 
-	if (override_rlimit || (sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
+	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
 		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
 	} else {
 		print_dropped_signal(sig);
 	}
 
 	if (unlikely(q == NULL)) {
-		if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
+		if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
 			put_ucounts(ucounts);
 	} else {
 		INIT_LIST_HEAD(&q->list);
-- 
2.29.3


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-07 17:23                 ` Linus Torvalds
  2021-07-08 10:33                   ` [PATCH v2] " Alexey Gladkov
@ 2021-07-08 11:00                   ` Alexey Gladkov
  1 sibling, 0 replies; 22+ messages in thread
From: Alexey Gladkov @ 2021-07-08 11:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linux Containers

On Wed, Jul 07, 2021 at 10:23:35AM -0700, Linus Torvalds wrote:
> On Wed, Jul 7, 2021 at 9:50 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > >   +             dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> > >   +             fallthrough;
> > >   +     case LONG_MAX:
> >
> > I think that the counter should be decreased in this case too.
> > inc_rlimit_ucounts() increments the counter in all parent userns. If we
> > don't decrease the counter then the parent userns will have a counter
> > leak.
> 
> Ack. So basically that patch, but move the dec_rlimit_ucounts() into
> the LONG_MAX case?

Yep.

> Would you mind making a real patch with a commit message, and trying
> whatever test-case you had for that KASAN use-after-free report?

Sure. I will.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] Fix UCOUNT_RLIMIT_SIGPENDING counter leak
  2021-07-08 10:33                   ` [PATCH v2] " Alexey Gladkov
@ 2021-07-08 18:44                     ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-07-08 18:44 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux Containers, Eric W . Biederman,
	Oleg Nesterov

On Thu, Jul 8, 2021 at 3:34 AM Alexey Gladkov <legion@kernel.org> wrote:
>
> We must properly handle an errors when we increase the rlimit counter
> and the ucounts reference counter. We have to this with RCU protection
> to prevent possible use-after-free that could occur due to concurrent
> put_cred_rcu().

Thanks, LGTM. Applied,

            Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-08 18:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28 22:35 [GIT PULL] ucounts: Count rlimits in each user namespace Eric W. Biederman
2021-06-29  3:47 ` Linus Torvalds
2021-06-29 15:04   ` Eric W. Biederman
2021-06-29 15:51   ` Eric W. Biederman
2021-06-29 16:34     ` Linus Torvalds
2021-06-29 16:42       ` Eric W. Biederman
2021-06-29 17:09         ` Eric W. Biederman
2021-07-01 16:41           ` Alexey Gladkov
2021-07-01 20:05             ` Eric W. Biederman
2021-06-29 17:17   ` Alexey Gladkov
2021-06-29 18:07     ` Linus Torvalds
2021-06-29 20:20       ` Alexey Gladkov
2021-06-29 20:33         ` Linus Torvalds
2021-06-29 21:22           ` Alexey Gladkov
2021-07-02 17:54           ` [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak Alexey Gladkov
2021-07-02 22:13             ` Linus Torvalds
2021-07-07 16:50               ` Alexey Gladkov
2021-07-07 17:23                 ` Linus Torvalds
2021-07-08 10:33                   ` [PATCH v2] " Alexey Gladkov
2021-07-08 18:44                     ` Linus Torvalds
2021-07-08 11:00                   ` [PATCH] ucounts: " Alexey Gladkov
2021-06-29  3:50 ` [GIT PULL] ucounts: Count rlimits in each user namespace pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).