All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	arnd@arndb.de, tglx@linutronix.de, luto@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org,
	ying.huang@intel.com
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs
Date: Wed, 22 Nov 2023 17:24:10 -0500	[thread overview]
Message-ID: <ZV5/ilfUoqC2PW0D@memverge.com> (raw)
In-Reply-To: <20231122133348.d27c09a90bce755dc1c0f251@linux-foundation.org>

On Wed, Nov 22, 2023 at 01:33:48PM -0800, Andrew Morton wrote:
> On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <gourry.memverge@gmail.com> wrote:
> 
> > The patch set changes task->mempolicy to be modifiable by tasks other
> > than just current.
> > 
> > The ultimate goal is to make mempolicy more flexible and extensible,
> > such as adding interleave weights (which may need to change at runtime
> > due to hotplug events).  Making mempolicy externally modifiable allows
> > for userland daemons to make runtime performance adjustments to running
> > tasks without that software needing to be made numa-aware.
> 
> Please add to this [0/N] a full description of the security aspect: who
> can modify whose mempolicy, along with a full description of the
> reasoning behind this decision.
> 

Will do. For the sake of v0 for now:

1) the task itself (task == current)
   for obvious reasons: it already can

2) from external interfaces: CAP_SYS_NICE

There might be an argument for CAP_SYS_ADMIN, but CAP_SYS_NICE has
access to scheduling controls, and mbind uses CAP_SYS_NICE to validate
whether shared pages can be migrated.  The same is true of migrate_pages
and other memory management controls.  For this reason, I chose to gate
the task syscalls behind CAP_SYS_NICE unless (task == current).

I'm by no means an expert in this area, so slap away if i'm egregiously
wrong here.

I will add additional security context to v2 as to what impacts changing
a mempolicy can have at runtime.  This will mostly be related to cpusets
implications, as mempolicy itself is not a "constraining" interface in
terms of security. For example: one can mbind/interleave/whatever a set
of nodes, and then use migrate_pages or move_pages to violate that
mempolicy. This is explicitly allowed and discussed in the
implementation of the existing syscalls / libnuma.

However, if cpusets must be respected.

This is why i refactored out replace_mempolicy and reused it, because
this enforcement is already handled by checking task->mems_allowed.

> > 3. Add external interfaces which allow for a task mempolicy to be
> >    modified by another task.  This is implemented in 4 syscalls
> >    and a procfs interface:
> >         sys_set_task_mempolicy
> >         sys_get_task_mempolicy
> >         sys_set_task_mempolicy_home_node
> >         sys_task_mbind
> >         /proc/[pid]/mempolicy
> 
> Why is the procfs interface needed?  Doesn't it simply duplicate the
> syscall interface?  Please update [0/N] with a description of this
> decision.
> 

Honestly I wrote the procfs interface first, and then came back around
to just implement the syscalls.  mbind is not friendly to being procfs'd
so if the preference is to have only one, not both, then it should
probably be the syscalls.

That said, when I introduce weighted interleave on top of this, having a
simple procfs interface to those weights would be valuable, so I
imagined something like `proc/mempolicy` to determine if interleave was
being used and something like `proc/mpol_interleave_weights` for a clean
interface to update weights.

However, in the same breath, I have a prior RFC with set/get_mempolicy2
which could probably take all future mempolicy extensions and wrap them
up into one pair of syscalls, instead of us ending up with 200 more
sys_mempolicy_whatever as memory attached fabrics become more common.

So... yeah... the is one area I think the community very much needs to
comment:  set/get_mempolicy2, many new mempolicy syscalls, procfs? All
of the above?

The procfs route provides a command-line user a nice, clean way to
update policies without the need for an additional tool, but if there is
an "all or nothing" preference on mempolicy controls - then procfs is
probably not the way to go.

This RFC at least shows there are options. I very much welcome input in
this particular area.

> > The new syscalls are the same as their current-task counterparts,
> > except that they take a pid as an argument.  The exception is
> > task_mbind, which required a new struct due to the number of args.
> > 
> > The /proc/pid/mempolicy re-uses the interface mpol_parse_str format
> > to enable get/set of mempolicy via procsfs.
> > 
> > mpol_parse_str format:
> >             <mode>[=<flags>][:<nodelist>]
> > 
> > Example usage:
> > 
> > echo "default" > /proc/pid/mempolicy
> > echo "prefer=relative:0" > /proc/pid/mempolicy
> > echo "interleave:0-3" > /proc/pid/mempolicy
> 
> What do we get when we read from this?  Please add to changelog.
>
> Also a description of the permissions for this procfs file, along with
> reasoning.  If it has global readability, and there's something
> interesting in there, let's show that the security implications have
> been fully considered.
> 

Ah, should have included that.  Will add.  For the sake of v0:

Current permissions: (S_IRUSR|S_IWUSR)
Which presumes the owner and obviosly root.  Tried parity the syscall.

the total set of (current) policy outputs are:

"default"
"local"
"prefer:node"
"prefer=static:node"
"prefer=relative:node"
"prefer (many):nodelist"
"prefer (many)=static:nodelist"
"prefer (many)=relative:nodelist"
"interleave:nodelist"
"interleave=static:nodelist"
"interleave=relative:nodelist"
"bind:nodelist"
"bind=static:nodelist"
"bind=relative:nodelist"

There doesn't seem to be much of a security implication here, at least
not anything that can't already be gleaned via something like numa_maps,
but it does provide *some* level of memory placement imformation, so
it's still probably best gated behind owner/root.

That said, changing this policy may not imply it is actually used,
because individual VMA policies can override this policy. So it really
doesn't provide much info at all.

Something I just noticed: mpol_parse_str does not presently support the
numa balancing flag, so that would have to be added to achieve parity
with the set_mempolicy syscall.

> > Changing the mempolicy does not induce memory migrations via the
> > procfs interface (which is the exact same behavior as set_mempolicy).
> > 
>

Thanks for taking a quick look!
~Gregory

  parent reply	other threads:[~2023-11-22 22:24 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-22 21:11 [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs Gregory Price
2023-11-22 21:11 ` [RFC PATCH 01/11] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
2023-11-22 21:11 ` [RFC PATCH 02/11] mm/mempolicy: swap cond reference counting logic in do_get_mempolicy Gregory Price
2023-11-28 14:07   ` Michal Hocko
     [not found]     ` <ZWX0ytAwmOdooHdZ@memverge.com>
2023-11-28 14:28       ` Michal Hocko
2023-11-22 21:11 ` [RFC PATCH 03/11] mm/mempolicy: refactor set_mempolicy stack to take a task argument Gregory Price
2023-11-22 21:11 ` [RFC PATCH 04/11] mm/mempolicy: modify get_mempolicy call " Gregory Price
2023-11-28 14:07   ` Michal Hocko
     [not found]     ` <ZWX1U1gCTXC+lFXn@memverge.com>
2023-11-28 14:49       ` Michal Hocko
2023-11-22 21:11 ` [RFC PATCH 05/11] mm/mempolicy: modify set_mempolicy_home_node " Gregory Price
2023-11-28 14:07   ` Michal Hocko
2023-11-28 14:14     ` Gregory Price
2023-11-22 21:11 ` [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current Gregory Price
2023-11-28 14:11   ` Michal Hocko
2023-11-28 14:51     ` Gregory Price
2023-11-28 18:08     ` Gregory Price
2023-11-22 21:11 ` [RFC PATCH 07/11] mm/mempolicy: add task mempolicy syscall variants Gregory Price
2023-11-24  7:56   ` kernel test robot
2023-11-24  7:56   ` kernel test robot
2023-11-24  7:57   ` kernel test robot
2023-11-22 21:11 ` [RFC PATCH 08/11] mm/mempolicy: export replace_mempolicy for use by procfs Gregory Price
2023-11-22 21:11 ` [RFC PATCH 09/11] mm/mempolicy: build mpol_parse_str unconditionally Gregory Price
2023-11-22 21:11 ` [RFC PATCH 10/11] mm/mempolicy: mpol_parse_str should ignore trailing characters in nodelist Gregory Price
2023-11-22 21:12 ` [RFC PATCH 11/11] fs/proc: Add mempolicy attribute to allow read/write of task mempolicy Gregory Price
2023-11-22 21:33 ` [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs Andrew Morton
2023-11-22 21:35   ` Andrew Morton
2023-11-22 22:24   ` Gregory Price [this message]
2023-11-27 15:29     ` Michal Hocko
2023-11-27 16:14       ` Gregory Price
2023-11-28  9:45         ` Michal Hocko
2023-11-28 13:15           ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZV5/ilfUoqC2PW0D@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.