All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neel Natu <neelnatu@google.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().
Date: Wed, 3 Aug 2022 14:21:46 -0700	[thread overview]
Message-ID: <CAJDe-O+7mqqrGvWimztYgEqDDqJGH=EA1NgJyiJjnurLASOQ8w@mail.gmail.com> (raw)
In-Reply-To: <YurEAU5SlQaZ85vZ@yury-laptop>

On Wed, Aug 3, 2022 at 11:52 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Wed, Aug 03, 2022 at 10:49:57AM -0700, Neel Natu wrote:
> > On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <yury.norov@gmail.com> wrote:
> > >
> > > On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <neelnatu@google.com> wrote:
> > > >
> > > > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > > > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > > > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> > > >
> > > > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > > > over cpus outside the 'cpu_possible_mask'.
> > > >
> > > > Fix this to make its behavior match for_each_cpu() which always limits
> > > > the iteration to the range [0, nr_cpu_ids).
> > > >
> > > > Signed-off-by: Neel Natu <neelnatu@google.com>
> > >
> > > The patch itself doesn't look correct because it randomly switches a piece
> > > of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> > > others.
> > >
> > > However...
> > >
> > > I don't know the story behind having 2 variables holding the max possible
> > > number of cpus, and it looks like it dates back to prehistoric times.  In
> > > modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> > > for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> > > CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> > > of every popular architecture.
> > >
> >
> > Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
> > I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
> > match the observation
> > above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.
>
> I said this because the comment in include/linux/cpumaks.h says:
>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>  *  ACPI reports present at boot.
>
> And because of the code that sets nr_cpu_ids:
>
> void __init setup_nr_cpu_ids(void)
> {
>         nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
> }
>
> Some arches override it, so it may be an issue. Are you running x86,
> or maybe you have "nr_cpus" boot parameter?
>
> But anyways, I feel like this should be investigated for more... Can you
> please share the config of your system and boot params?
>

Yeah, this is a stock defconfig compiled on an x86_64 host and booted
inside a 20 vcpu virtual machine (virtme). There are no command line
params to modify the number of cpus.

I think everything is working as expected based on some debug prints I
added during boot:
[    0.641798] smp: setup_nr_cpu_ids: nr_cpu_ids 20, cpu_possible_mask 0-19
[    0.648424] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64
nr_cpu_ids:20 nr_node_ids:2

The first one is from setup_nr_cpu_ids() in kernel/smp.c. The second
one is from setup_per_cpu_areas() from arch/x86/setup_percpu.c.

> > > In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> > > and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> > > comments to cpumask functions and in the code those two are referred
> > > interchangeably.
> > >
> > > Even more interestingly, we have a function bitmap_setall() that sets all bits
> > > up to nr_cpumask_bits, and it could trigger the problem that you described,
> >
> > I think you mean cpumask_setall() that in turn calls
> > bitmap_fill(nr_cpumask_bits)?
>
> Yes, sorry.
>
> > > so that someone would complain. (Are there any other valid reasons to set
> > > bits behind nr_cpu_ids intentionally?)
> > >
> >
> > I don't know of any although this wasn't the case that trigger in my case.
> >
> > > Can you share more details about how you triggered that? If you observe
> > > those bits set, something else is probably already wrong...
> >
> > The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
> > over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
> > nr_cpumask_bits)
> > range.
>
> If you receive bitmap from userspace, you need to copy it with
> bitmap_copy_clear_tail(), or bitmap_from_arr{32,64}, as appropriate.
> That will clear unneeded bits.
>
> For user-to-kernel communications, I'd recommend passing bitmaps in a
> human-readable formats - hex string or bitmap list. Check the functions
> cpumask_parse_user() and cpumask_parselist_user(). This would help to
> avoid all possible issues related to endianness and 32/64 word length
> mismatch.
>
> > The kernel code is out of tree but open source so happy to provide a
> > pointer if needed.
>
> Yes please
>

Here is where we copy the user supplied cpumask using the
get_user_cpu_mask() helper:
https://github.com/google/ghost-kernel/blob/c21b36f87663efa2189876b2caa701fe74e72adf/kernel/sched/ghost.c#L5729

For performance reasons we cannot use human readable cpu masks in this
code path.

Note that the helper copies up to 'nr_cpumask_bits' which in some
kernels (!CONFIG_CPUMASK_OFFSTACK) can copy bits beyond 'nr_cpu_ids'.
A possible option could be to fix this helper but I do feel that
for_each_cpu() and for_each_cpu_wrap() should visit the same set of
cpus given the same cpumask (ordering can be different but the set of
cpus should remain the same).

What do you think?

best
Neel

> > I considered ANDing the user supplied mask with 'cpu_possible_mask'
> > but that felt
> > like working around an inconsistency in the kernel API (hence the proposed fix).
> >
> > > So, if there is no real case in modern kernel to have nr_cpumask_bits and
> > > nr_cpu_ids different, the proper fix would be to just drop the first.
> > >
> >
> > I'll let other people more knowledgable than me in this area chime in.
> > I'll be happy either
> > way if that fixes the problem at hand.
> >
> > best
> > Neel
> >
> > > If there is such a case, this is probably your case, and we'd know more
> > > details to understand how to deal with that.
> > >
> > > Thanks,
> > > Yury

  reply	other threads:[~2022-08-03 21:22 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-02 21:40 [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap() Neel Natu
2022-08-03  1:22 ` Yury Norov
2022-08-03 17:49   ` Neel Natu
2022-08-03 18:52     ` Yury Norov
2022-08-03 21:21       ` Neel Natu [this message]
2022-08-03 23:51         ` Yury Norov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJDe-O+7mqqrGvWimztYgEqDDqJGH=EA1NgJyiJjnurLASOQ8w@mail.gmail.com' \
    --to=neelnatu@google.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=peterz@infradead.org \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.