LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: Christian Brauner <christian@brauner.io>
Cc: devel@driverdev.osuosl.org, tkjos@android.com,
	linux-kernel@vger.kernel.org, arve@android.com,
	joel@joelfernandes.org, maco@android.com
Subject: Re: [PATCH] binderfs: implement sysctls
Date: Fri, 21 Dec 2018 17:33:16 +0100
Message-ID: <20181221163316.GA8517@kroah.com> (raw)
In-Reply-To: <20181221155918.b3sgv2adx5i74r6i@brauner.io>

On Fri, Dec 21, 2018 at 04:59:19PM +0100, Christian Brauner wrote:
> On Fri, Dec 21, 2018 at 04:37:58PM +0100, Greg KH wrote:
> > On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> > > On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > > > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > > > This implements three sysctls that have very specific goals:
> > > > 
> > > > Ick, why?
> > > > 
> > > > What are these going to be used for?  Who will "control" them?  As you
> > > 
> > > Only global root in the initial user namespace. See the reasons below. :)
> > > 
> > > > are putting them in the "global" namespace, that feels like something
> > > > that binderfs was trying to avoid in the first place.
> > > 
> > > There are a couple of reason imho:
> > > - Global root needs a way to restrict how many binder devices can be
> > >   allocated across all user + ipc namespace pairs.
> > >   One obvious reason is that otherwise userns root in a non-initial user
> > >   namespace can allocate a huge number of binder devices (pick a random
> > >   number say 10.000) and use up a lot of kernel memory.
> > 
> > Root can do tons of other bad things too, why are you picking on
> 
> That's global root not userns root though. :) These sysctls are about
> global root gaining the ability to proactively restrict binder device
> delegation.
> 
> > binderfs here?  :)
> > 
> > >   In addition they can pound on the binder.c code causing a lot of
> > >   contention for the remaining global lock in there.
> > 
> > That's the problem of that container, don't let it do that.  Or remove
> > the global lock :)
> > 
> > >   We should let global root explicitly restrict non-initial namespaces
> > >   in this respect. Imho, that's just good security design. :)
> > 
> > If you do not trust your container enough to have it properly allocate
> > the correct binder resources, then perhaps you shouldn't be allowing it
> > to allocate any resources at all?
> 
> Containers just like VMs get delegated and you might not have control
> over what is running in there. That's AWS in a nutshell. :) Restricting
> it by saying "just don't do that" seems not something that is
> appropriate given the workloads out there in the wild.
> In general, I do *understand* the reasoning but I think the premise is
> flawed if we can somewhat trivially make this safe.
> 
> > 
> > > - The reason for having a number of reserved devices is when the initial
> > >   binderfs mount needs to bump the number of binder devices after the
> > >   initial allocation done during say boot (e.g. it could've removed
> > >   devices and wants to reallocate new ones but all binder minor numbers
> > >   have been given out or just needs additional devices). By reserving an
> > >   initial pool of binder devices this can be easily accounted for and
> > >   future proofs userspace. This is to say: global root in the initial
> > >   userns + ipcns gets dibs on however many devices it wants. :)
> > 
> > binder devices do not "come and go" at runtime, you need to set them up
> > initially and then all is fine.  So there should never be a need for the
> > "global" instance to need "more" binder devices once it is up and
> > running.  So I don't see what you are really trying to solve here.
> 
> That's dismissing a whole range of use-cases where you might allocate
> and deallocate devices on the fly which this is somewhat designed for.
> But I guess ok for now.
> 
> > 
> > You seem to be trying to protect the system from the container you just
> > gave root to and trusted it with creating its own binder instances.
> > If you do not trust it to create binder instances then do not allow it
> > to create binder instances!  :)
> 
> Again, I get the reasoning but think that this dismisses major
> real-world use-cases not just for binderfs but for all instances where
> untrusted workloads are run which both containers and VMs aim to make
> sure are possible.
> Note, I mean untrusted not in the sense of necessarily being malicious
> but just "can't guarantee that things don't blow up in your face".
> 
> > 
> > > - The fact that we have a single shared pool of binder device minor
> > >   numbers for all namespaces imho makes it necessary for the global root
> > >   user in the initial ipc + user namespace to manage device allocation
> > >   and delegation.
> > 
> > You are managing the allocation, you are giving who ever asks for one a
> > device.  If you run out of devices, oops, you run out of devices, that's
> > it.  Are you really ever going to run out of a major's number of binder
> > devices?
> 
> The point is more about someone intentionally trying to do that.
> 
> > 
> > > The binderfs sysctl stuff is really small code-wise and adds a lot of
> > > security without any performance impact on the code itself. So we
> > > actually very strictly adhere to the requirement to not blindly
> > > sacrifice performance for security. :)
> > 
> > But you are adding a brand new user/kernel api by emulating one that is
> > very old and not the best at all, to try to protect from something that
> > seems like you can't really "protect" from in the first place.
> 
> Of course we can protect from that. It's about init root never running
> out of devices. We don't care about non-init-userns running out of
> devices at all.
> 
> > 
> > You now have a mis-match of sysctls, ioctls and file operations all
> > working on the same logical thing.  And all interacting in different and
> > uncertian ways.  Are you sure that's wise?
> 
> The sysctl approach is present in other pseudo-filesystems apart from
> devpts. For example, mqueue.
> 
> > 
> > If the binderfs code as-is isn't "safe enough" to use without this, then
> > we need to revisit it before someone starts to use it...
> 
> *It is safe.* I just don't see a good argument against additional
> hardening. *But I'm definitely not going to push this patch if it's
> additional hardening features are used to push the unsound argument that
> binderfs isn't safe.* :)

Ok, so what you really want is just "limits" to prevent a container from
doing something really crazy, right?  So, how about a limit of 10 binder
nodes per container?  Make it a kernel build option so it can be changed
by a vendor if they really find that is a problem.

Would that solve the issue you are thinking might be here?

thanks,

greg k-h

  reply index

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-21 13:39 Christian Brauner
2018-12-21 13:55 ` Greg KH
2018-12-21 14:12   ` Christian Brauner
2018-12-21 15:37     ` Greg KH
2018-12-21 15:59       ` Christian Brauner
2018-12-21 16:33         ` Greg KH [this message]
2018-12-21 16:38           ` Todd Kjos
2018-12-21 19:20           ` Christian Brauner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181221163316.GA8517@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=arve@android.com \
    --cc=christian@brauner.io \
    --cc=devel@driverdev.osuosl.org \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maco@android.com \
    --cc=tkjos@android.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git