linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] binderfs: implement sysctls
@ 2018-12-21 13:39 Christian Brauner
  2018-12-21 13:55 ` Greg KH
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2018-12-21 13:39 UTC (permalink / raw)
  To: gregkh, tkjos, devel, linux-kernel; +Cc: arve, maco, joel, Christian Brauner

This implements three sysctls that have very specific goals:

1. /proc/sys/fs/binder/max:
   Allow global root to globally limit the number of allocatable binder
   devices.
2. /proc/sys/fs/binder/nr:
   Allow global root to easily detect how many binder devices are currently
   in use across all binderfs mounts.
3. /proc/sys/fs/binder/reserved:
   Ensure that global root can reserve binder devices for the initial
   binderfs mount in the initial ipc namespace to prevent DOS attacks.

This is equivalent to sysctls of devpts.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 drivers/android/binderfs.c | 81 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 2 deletions(-)

diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
index 7496b10532aa..5ff015f82314 100644
--- a/drivers/android/binderfs.c
+++ b/drivers/android/binderfs.c
@@ -64,6 +64,71 @@ struct binderfs_info {
 
 };
 
+/* Global default limit on the number of binder devices. */
+static int device_limit = 4096;
+
+/*
+ * Number of binder devices reserved for the initial binderfs mount in the
+ * initial ipc namespace.
+ */
+static int device_reserve = 1024;
+
+/* Dummy sysctl minimum. */
+static int device_limit_min;
+
+/* Cap sysctl at BINDERFS_MAX_MINOR. */
+static int device_limit_max = BINDERFS_MAX_MINOR;
+
+/* Current number of allocated binder devices. */
+static atomic_t device_count = ATOMIC_INIT(0);
+
+static struct ctl_table binderfs_table[] = {
+	{
+		.procname	= "max",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.data		= &device_limit,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &device_limit_min,
+		.extra2		= &device_limit_max,
+	},
+	{
+		.procname	= "reserve",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.data		= &device_reserve,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &device_limit_min,
+		.extra2		= &device_limit_max,
+	},
+	{
+		.procname	= "nr",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.data		= &device_count,
+		.proc_handler	= proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table binderfs_fs_table[] = {
+	{
+		.procname	= "binder",
+		.mode		= 0555,
+		.child		= binderfs_table,
+	},
+	{}
+};
+
+static struct ctl_table binderfs_root_table[] = {
+	{
+		.procname	= "fs",
+		.mode		= 0555,
+		.child		= binderfs_fs_table,
+	},
+	{}
+};
+
 static inline struct binderfs_info *BINDERFS_I(const struct inode *inode)
 {
 	return inode->i_sb->s_fs_info;
@@ -107,13 +172,21 @@ static int binderfs_binder_device_create(struct inode *ref_inode,
 	struct inode *inode = NULL;
 	struct super_block *sb = ref_inode->i_sb;
 	struct binderfs_info *info = sb->s_fs_info;
+	bool use_reserved = (info->ipc_ns == &init_ipc_ns);
 
 	/* Reserve new minor number for the new device. */
 	mutex_lock(&binderfs_minors_mutex);
-	minor = ida_alloc_max(&binderfs_minors, BINDERFS_MAX_MINOR, GFP_KERNEL);
+	if (atomic_inc_return(&device_count) <
+	    (device_limit - (use_reserved ? 0 : device_reserve)))
+		minor = ida_alloc_max(&binderfs_minors, BINDERFS_MAX_MINOR,
+				      GFP_KERNEL);
+	else
+		minor = -ENOSPC;
 	mutex_unlock(&binderfs_minors_mutex);
-	if (minor < 0)
+	if (minor < 0) {
+		atomic_dec(&device_count);
 		return minor;
+	}
 
 	ret = -ENOMEM;
 	device = kzalloc(sizeof(*device), GFP_KERNEL);
@@ -187,6 +260,7 @@ static int binderfs_binder_device_create(struct inode *ref_inode,
 	kfree(name);
 	kfree(device);
 	mutex_lock(&binderfs_minors_mutex);
+	atomic_dec(&device_count);
 	ida_free(&binderfs_minors, minor);
 	mutex_unlock(&binderfs_minors_mutex);
 	iput(inode);
@@ -239,6 +313,7 @@ static void binderfs_evict_inode(struct inode *inode)
 		return;
 
 	mutex_lock(&binderfs_minors_mutex);
+	atomic_dec(&device_count);
 	ida_free(&binderfs_minors, device->miscdev.minor);
 	mutex_unlock(&binderfs_minors_mutex);
 
@@ -536,6 +611,8 @@ static int __init init_binderfs(void)
 		binderfs_mnt = NULL;
 		unregister_filesystem(&binder_fs_type);
 		unregister_chrdev_region(binderfs_dev, BINDERFS_MAX_MINOR);
+	} else {
+		register_sysctl_table(binderfs_root_table);
 	}
 
 	return ret;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 13:39 [PATCH] binderfs: implement sysctls Christian Brauner
@ 2018-12-21 13:55 ` Greg KH
  2018-12-21 14:12   ` Christian Brauner
  0 siblings, 1 reply; 8+ messages in thread
From: Greg KH @ 2018-12-21 13:55 UTC (permalink / raw)
  To: Christian Brauner; +Cc: tkjos, devel, linux-kernel, joel, arve, maco

On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> This implements three sysctls that have very specific goals:

Ick, why?

What are these going to be used for?  Who will "control" them?  As you
are putting them in the "global" namespace, that feels like something
that binderfs was trying to avoid in the first place.

> 1. /proc/sys/fs/binder/max:
>    Allow global root to globally limit the number of allocatable binder
>    devices.

Why?  Who cares?  Memory should be your only limit here, and when you
run into that limit, you will start failing :)

> 2. /proc/sys/fs/binder/nr:
>    Allow global root to easily detect how many binder devices are currently
>    in use across all binderfs mounts.

Why?  Again, who cares?

> 3. /proc/sys/fs/binder/reserved:
>    Ensure that global root can reserve binder devices for the initial
>    binderfs mount in the initial ipc namespace to prevent DOS attacks.

Huh?  Can't you just create your "global root" devices first?  Doesn't
the code do that already anyway?

And what kind of DoS attack could this ever prevent from anyway?

> This is equivalent to sysctls of devpts.

devpts isn't exactly the best thing to emulate :)

> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  drivers/android/binderfs.c | 81 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 79 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
> index 7496b10532aa..5ff015f82314 100644
> --- a/drivers/android/binderfs.c
> +++ b/drivers/android/binderfs.c
> @@ -64,6 +64,71 @@ struct binderfs_info {
>  
>  };
>  
> +/* Global default limit on the number of binder devices. */
> +static int device_limit = 4096;
> +
> +/*
> + * Number of binder devices reserved for the initial binderfs mount in the
> + * initial ipc namespace.
> + */
> +static int device_reserve = 1024;
> +
> +/* Dummy sysctl minimum. */
> +static int device_limit_min;
> +
> +/* Cap sysctl at BINDERFS_MAX_MINOR. */
> +static int device_limit_max = BINDERFS_MAX_MINOR;
> +
> +/* Current number of allocated binder devices. */
> +static atomic_t device_count = ATOMIC_INIT(0);

You have a lock you are using, just rely on that, don't create
yet-another-type-of-unneeded-lock with an atomic here.

Anyway, I really don't see the need for any of this just yet, so I
didn't read beyond this point in the code :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 13:55 ` Greg KH
@ 2018-12-21 14:12   ` Christian Brauner
  2018-12-21 15:37     ` Greg KH
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2018-12-21 14:12 UTC (permalink / raw)
  To: Greg KH; +Cc: tkjos, devel, linux-kernel, joel, arve, maco

On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > This implements three sysctls that have very specific goals:
> 
> Ick, why?
> 
> What are these going to be used for?  Who will "control" them?  As you

Only global root in the initial user namespace. See the reasons below. :)

> are putting them in the "global" namespace, that feels like something
> that binderfs was trying to avoid in the first place.

There are a couple of reason imho:
- Global root needs a way to restrict how many binder devices can be
  allocated across all user + ipc namespace pairs.
  One obvious reason is that otherwise userns root in a non-initial user
  namespace can allocate a huge number of binder devices (pick a random
  number say 10.000) and use up a lot of kernel memory.
  In addition they can pound on the binder.c code causing a lot of
  contention for the remaining global lock in there.
  We should let global root explicitly restrict non-initial namespaces
  in this respect. Imho, that's just good security design. :)
- The reason for having a number of reserved devices is when the initial
  binderfs mount needs to bump the number of binder devices after the
  initial allocation done during say boot (e.g. it could've removed
  devices and wants to reallocate new ones but all binder minor numbers
  have been given out or just needs additional devices). By reserving an
  initial pool of binder devices this can be easily accounted for and
  future proofs userspace. This is to say: global root in the initial
  userns + ipcns gets dibs on however many devices it wants. :)
- The fact that we have a single shared pool of binder device minor
  numbers for all namespaces imho makes it necessary for the global root
  user in the initial ipc + user namespace to manage device allocation
  and delegation.

The binderfs sysctl stuff is really small code-wise and adds a lot of
security without any performance impact on the code itself. So we
actually very strictly adhere to the requirement to not blindly
sacrifice performance for security. :)

> 
> > 1. /proc/sys/fs/binder/max:
> >    Allow global root to globally limit the number of allocatable binder
> >    devices.
> 
> Why?  Who cares?  Memory should be your only limit here, and when you
> run into that limit, you will start failing :)
> 
> > 2. /proc/sys/fs/binder/nr:
> >    Allow global root to easily detect how many binder devices are currently
> >    in use across all binderfs mounts.
> 
> Why?  Again, who cares?
> 
> > 3. /proc/sys/fs/binder/reserved:
> >    Ensure that global root can reserve binder devices for the initial
> >    binderfs mount in the initial ipc namespace to prevent DOS attacks.
> 
> Huh?  Can't you just create your "global root" devices first?  Doesn't
> the code do that already anyway?
> 
> And what kind of DoS attack could this ever prevent from anyway?
> 
> > This is equivalent to sysctls of devpts.
> 
> devpts isn't exactly the best thing to emulate :)

It's one of its saner design choices though. :)

> 
> > 
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  drivers/android/binderfs.c | 81 +++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 79 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
> > index 7496b10532aa..5ff015f82314 100644
> > --- a/drivers/android/binderfs.c
> > +++ b/drivers/android/binderfs.c
> > @@ -64,6 +64,71 @@ struct binderfs_info {
> >  
> >  };
> >  
> > +/* Global default limit on the number of binder devices. */
> > +static int device_limit = 4096;
> > +
> > +/*
> > + * Number of binder devices reserved for the initial binderfs mount in the
> > + * initial ipc namespace.
> > + */
> > +static int device_reserve = 1024;
> > +
> > +/* Dummy sysctl minimum. */
> > +static int device_limit_min;
> > +
> > +/* Cap sysctl at BINDERFS_MAX_MINOR. */
> > +static int device_limit_max = BINDERFS_MAX_MINOR;
> > +
> > +/* Current number of allocated binder devices. */
> > +static atomic_t device_count = ATOMIC_INIT(0);
> 
> You have a lock you are using, just rely on that, don't create
> yet-another-type-of-unneeded-lock with an atomic here.
> 
> Anyway, I really don't see the need for any of this just yet, so I
> didn't read beyond this point in the code :)
> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 14:12   ` Christian Brauner
@ 2018-12-21 15:37     ` Greg KH
  2018-12-21 15:59       ` Christian Brauner
  0 siblings, 1 reply; 8+ messages in thread
From: Greg KH @ 2018-12-21 15:37 UTC (permalink / raw)
  To: Christian Brauner; +Cc: devel, tkjos, linux-kernel, arve, joel, maco

On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > This implements three sysctls that have very specific goals:
> > 
> > Ick, why?
> > 
> > What are these going to be used for?  Who will "control" them?  As you
> 
> Only global root in the initial user namespace. See the reasons below. :)
> 
> > are putting them in the "global" namespace, that feels like something
> > that binderfs was trying to avoid in the first place.
> 
> There are a couple of reason imho:
> - Global root needs a way to restrict how many binder devices can be
>   allocated across all user + ipc namespace pairs.
>   One obvious reason is that otherwise userns root in a non-initial user
>   namespace can allocate a huge number of binder devices (pick a random
>   number say 10.000) and use up a lot of kernel memory.

Root can do tons of other bad things too, why are you picking on
binderfs here?  :)

>   In addition they can pound on the binder.c code causing a lot of
>   contention for the remaining global lock in there.

That's the problem of that container, don't let it do that.  Or remove
the global lock :)

>   We should let global root explicitly restrict non-initial namespaces
>   in this respect. Imho, that's just good security design. :)

If you do not trust your container enough to have it properly allocate
the correct binder resources, then perhaps you shouldn't be allowing it
to allocate any resources at all?

> - The reason for having a number of reserved devices is when the initial
>   binderfs mount needs to bump the number of binder devices after the
>   initial allocation done during say boot (e.g. it could've removed
>   devices and wants to reallocate new ones but all binder minor numbers
>   have been given out or just needs additional devices). By reserving an
>   initial pool of binder devices this can be easily accounted for and
>   future proofs userspace. This is to say: global root in the initial
>   userns + ipcns gets dibs on however many devices it wants. :)

binder devices do not "come and go" at runtime, you need to set them up
initially and then all is fine.  So there should never be a need for the
"global" instance to need "more" binder devices once it is up and
running.  So I don't see what you are really trying to solve here.

You seem to be trying to protect the system from the container you just
gave root to and trusted it with creating its own binder instances.
If you do not trust it to create binder instances then do not allow it
to create binder instances!  :)

> - The fact that we have a single shared pool of binder device minor
>   numbers for all namespaces imho makes it necessary for the global root
>   user in the initial ipc + user namespace to manage device allocation
>   and delegation.

You are managing the allocation, you are giving who ever asks for one a
device.  If you run out of devices, oops, you run out of devices, that's
it.  Are you really ever going to run out of a major's number of binder
devices?

> The binderfs sysctl stuff is really small code-wise and adds a lot of
> security without any performance impact on the code itself. So we
> actually very strictly adhere to the requirement to not blindly
> sacrifice performance for security. :)

But you are adding a brand new user/kernel api by emulating one that is
very old and not the best at all, to try to protect from something that
seems like you can't really "protect" from in the first place.

You now have a mis-match of sysctls, ioctls and file operations all
working on the same logical thing.  And all interacting in different and
uncertian ways.  Are you sure that's wise?

If the binderfs code as-is isn't "safe enough" to use without this, then
we need to revisit it before someone starts to use it...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 15:37     ` Greg KH
@ 2018-12-21 15:59       ` Christian Brauner
  2018-12-21 16:33         ` Greg KH
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2018-12-21 15:59 UTC (permalink / raw)
  To: Greg KH; +Cc: devel, tkjos, linux-kernel, arve, joel, maco

On Fri, Dec 21, 2018 at 04:37:58PM +0100, Greg KH wrote:
> On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> > On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > > This implements three sysctls that have very specific goals:
> > > 
> > > Ick, why?
> > > 
> > > What are these going to be used for?  Who will "control" them?  As you
> > 
> > Only global root in the initial user namespace. See the reasons below. :)
> > 
> > > are putting them in the "global" namespace, that feels like something
> > > that binderfs was trying to avoid in the first place.
> > 
> > There are a couple of reason imho:
> > - Global root needs a way to restrict how many binder devices can be
> >   allocated across all user + ipc namespace pairs.
> >   One obvious reason is that otherwise userns root in a non-initial user
> >   namespace can allocate a huge number of binder devices (pick a random
> >   number say 10.000) and use up a lot of kernel memory.
> 
> Root can do tons of other bad things too, why are you picking on

That's global root not userns root though. :) These sysctls are about
global root gaining the ability to proactively restrict binder device
delegation.

> binderfs here?  :)
> 
> >   In addition they can pound on the binder.c code causing a lot of
> >   contention for the remaining global lock in there.
> 
> That's the problem of that container, don't let it do that.  Or remove
> the global lock :)
> 
> >   We should let global root explicitly restrict non-initial namespaces
> >   in this respect. Imho, that's just good security design. :)
> 
> If you do not trust your container enough to have it properly allocate
> the correct binder resources, then perhaps you shouldn't be allowing it
> to allocate any resources at all?

Containers just like VMs get delegated and you might not have control
over what is running in there. That's AWS in a nutshell. :) Restricting
it by saying "just don't do that" seems not something that is
appropriate given the workloads out there in the wild.
In general, I do *understand* the reasoning but I think the premise is
flawed if we can somewhat trivially make this safe.

> 
> > - The reason for having a number of reserved devices is when the initial
> >   binderfs mount needs to bump the number of binder devices after the
> >   initial allocation done during say boot (e.g. it could've removed
> >   devices and wants to reallocate new ones but all binder minor numbers
> >   have been given out or just needs additional devices). By reserving an
> >   initial pool of binder devices this can be easily accounted for and
> >   future proofs userspace. This is to say: global root in the initial
> >   userns + ipcns gets dibs on however many devices it wants. :)
> 
> binder devices do not "come and go" at runtime, you need to set them up
> initially and then all is fine.  So there should never be a need for the
> "global" instance to need "more" binder devices once it is up and
> running.  So I don't see what you are really trying to solve here.

That's dismissing a whole range of use-cases where you might allocate
and deallocate devices on the fly which this is somewhat designed for.
But I guess ok for now.

> 
> You seem to be trying to protect the system from the container you just
> gave root to and trusted it with creating its own binder instances.
> If you do not trust it to create binder instances then do not allow it
> to create binder instances!  :)

Again, I get the reasoning but think that this dismisses major
real-world use-cases not just for binderfs but for all instances where
untrusted workloads are run which both containers and VMs aim to make
sure are possible.
Note, I mean untrusted not in the sense of necessarily being malicious
but just "can't guarantee that things don't blow up in your face".

> 
> > - The fact that we have a single shared pool of binder device minor
> >   numbers for all namespaces imho makes it necessary for the global root
> >   user in the initial ipc + user namespace to manage device allocation
> >   and delegation.
> 
> You are managing the allocation, you are giving who ever asks for one a
> device.  If you run out of devices, oops, you run out of devices, that's
> it.  Are you really ever going to run out of a major's number of binder
> devices?

The point is more about someone intentionally trying to do that.

> 
> > The binderfs sysctl stuff is really small code-wise and adds a lot of
> > security without any performance impact on the code itself. So we
> > actually very strictly adhere to the requirement to not blindly
> > sacrifice performance for security. :)
> 
> But you are adding a brand new user/kernel api by emulating one that is
> very old and not the best at all, to try to protect from something that
> seems like you can't really "protect" from in the first place.

Of course we can protect from that. It's about init root never running
out of devices. We don't care about non-init-userns running out of
devices at all.

> 
> You now have a mis-match of sysctls, ioctls and file operations all
> working on the same logical thing.  And all interacting in different and
> uncertian ways.  Are you sure that's wise?

The sysctl approach is present in other pseudo-filesystems apart from
devpts. For example, mqueue.

> 
> If the binderfs code as-is isn't "safe enough" to use without this, then
> we need to revisit it before someone starts to use it...

*It is safe.* I just don't see a good argument against additional
hardening. *But I'm definitely not going to push this patch if it's
additional hardening features are used to push the unsound argument that
binderfs isn't safe.* :)

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 15:59       ` Christian Brauner
@ 2018-12-21 16:33         ` Greg KH
  2018-12-21 16:38           ` Todd Kjos
  2018-12-21 19:20           ` Christian Brauner
  0 siblings, 2 replies; 8+ messages in thread
From: Greg KH @ 2018-12-21 16:33 UTC (permalink / raw)
  To: Christian Brauner; +Cc: devel, tkjos, linux-kernel, arve, joel, maco

On Fri, Dec 21, 2018 at 04:59:19PM +0100, Christian Brauner wrote:
> On Fri, Dec 21, 2018 at 04:37:58PM +0100, Greg KH wrote:
> > On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> > > On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > > > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > > > This implements three sysctls that have very specific goals:
> > > > 
> > > > Ick, why?
> > > > 
> > > > What are these going to be used for?  Who will "control" them?  As you
> > > 
> > > Only global root in the initial user namespace. See the reasons below. :)
> > > 
> > > > are putting them in the "global" namespace, that feels like something
> > > > that binderfs was trying to avoid in the first place.
> > > 
> > > There are a couple of reason imho:
> > > - Global root needs a way to restrict how many binder devices can be
> > >   allocated across all user + ipc namespace pairs.
> > >   One obvious reason is that otherwise userns root in a non-initial user
> > >   namespace can allocate a huge number of binder devices (pick a random
> > >   number say 10.000) and use up a lot of kernel memory.
> > 
> > Root can do tons of other bad things too, why are you picking on
> 
> That's global root not userns root though. :) These sysctls are about
> global root gaining the ability to proactively restrict binder device
> delegation.
> 
> > binderfs here?  :)
> > 
> > >   In addition they can pound on the binder.c code causing a lot of
> > >   contention for the remaining global lock in there.
> > 
> > That's the problem of that container, don't let it do that.  Or remove
> > the global lock :)
> > 
> > >   We should let global root explicitly restrict non-initial namespaces
> > >   in this respect. Imho, that's just good security design. :)
> > 
> > If you do not trust your container enough to have it properly allocate
> > the correct binder resources, then perhaps you shouldn't be allowing it
> > to allocate any resources at all?
> 
> Containers just like VMs get delegated and you might not have control
> over what is running in there. That's AWS in a nutshell. :) Restricting
> it by saying "just don't do that" seems not something that is
> appropriate given the workloads out there in the wild.
> In general, I do *understand* the reasoning but I think the premise is
> flawed if we can somewhat trivially make this safe.
> 
> > 
> > > - The reason for having a number of reserved devices is when the initial
> > >   binderfs mount needs to bump the number of binder devices after the
> > >   initial allocation done during say boot (e.g. it could've removed
> > >   devices and wants to reallocate new ones but all binder minor numbers
> > >   have been given out or just needs additional devices). By reserving an
> > >   initial pool of binder devices this can be easily accounted for and
> > >   future proofs userspace. This is to say: global root in the initial
> > >   userns + ipcns gets dibs on however many devices it wants. :)
> > 
> > binder devices do not "come and go" at runtime, you need to set them up
> > initially and then all is fine.  So there should never be a need for the
> > "global" instance to need "more" binder devices once it is up and
> > running.  So I don't see what you are really trying to solve here.
> 
> That's dismissing a whole range of use-cases where you might allocate
> and deallocate devices on the fly which this is somewhat designed for.
> But I guess ok for now.
> 
> > 
> > You seem to be trying to protect the system from the container you just
> > gave root to and trusted it with creating its own binder instances.
> > If you do not trust it to create binder instances then do not allow it
> > to create binder instances!  :)
> 
> Again, I get the reasoning but think that this dismisses major
> real-world use-cases not just for binderfs but for all instances where
> untrusted workloads are run which both containers and VMs aim to make
> sure are possible.
> Note, I mean untrusted not in the sense of necessarily being malicious
> but just "can't guarantee that things don't blow up in your face".
> 
> > 
> > > - The fact that we have a single shared pool of binder device minor
> > >   numbers for all namespaces imho makes it necessary for the global root
> > >   user in the initial ipc + user namespace to manage device allocation
> > >   and delegation.
> > 
> > You are managing the allocation, you are giving who ever asks for one a
> > device.  If you run out of devices, oops, you run out of devices, that's
> > it.  Are you really ever going to run out of a major's number of binder
> > devices?
> 
> The point is more about someone intentionally trying to do that.
> 
> > 
> > > The binderfs sysctl stuff is really small code-wise and adds a lot of
> > > security without any performance impact on the code itself. So we
> > > actually very strictly adhere to the requirement to not blindly
> > > sacrifice performance for security. :)
> > 
> > But you are adding a brand new user/kernel api by emulating one that is
> > very old and not the best at all, to try to protect from something that
> > seems like you can't really "protect" from in the first place.
> 
> Of course we can protect from that. It's about init root never running
> out of devices. We don't care about non-init-userns running out of
> devices at all.
> 
> > 
> > You now have a mis-match of sysctls, ioctls and file operations all
> > working on the same logical thing.  And all interacting in different and
> > uncertian ways.  Are you sure that's wise?
> 
> The sysctl approach is present in other pseudo-filesystems apart from
> devpts. For example, mqueue.
> 
> > 
> > If the binderfs code as-is isn't "safe enough" to use without this, then
> > we need to revisit it before someone starts to use it...
> 
> *It is safe.* I just don't see a good argument against additional
> hardening. *But I'm definitely not going to push this patch if it's
> additional hardening features are used to push the unsound argument that
> binderfs isn't safe.* :)

Ok, so what you really want is just "limits" to prevent a container from
doing something really crazy, right?  So, how about a limit of 10 binder
nodes per container?  Make it a kernel build option so it can be changed
by a vendor if they really find that is a problem.

Would that solve the issue you are thinking might be here?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 16:33         ` Greg KH
@ 2018-12-21 16:38           ` Todd Kjos
  2018-12-21 19:20           ` Christian Brauner
  1 sibling, 0 replies; 8+ messages in thread
From: Todd Kjos @ 2018-12-21 16:38 UTC (permalink / raw)
  To: Greg KH
  Cc: Christian Brauner, open list:ANDROID DRIVERS, Todd Kjos, LKML,
	Arve Hjønnevåg, joel, Martijn Coenen

On Fri, Dec 21, 2018 at 8:33 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, Dec 21, 2018 at 04:59:19PM +0100, Christian Brauner wrote:
> > On Fri, Dec 21, 2018 at 04:37:58PM +0100, Greg KH wrote:
> > > On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> > > > On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > > > > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > > > > This implements three sysctls that have very specific goals:
> > > > >
> > > > > Ick, why?
> > > > >
> > > > > What are these going to be used for?  Who will "control" them?  As you
> > > >
> > > > Only global root in the initial user namespace. See the reasons below. :)
> > > >
> > > > > are putting them in the "global" namespace, that feels like something
> > > > > that binderfs was trying to avoid in the first place.
> > > >
> > > > There are a couple of reason imho:
> > > > - Global root needs a way to restrict how many binder devices can be
> > > >   allocated across all user + ipc namespace pairs.
> > > >   One obvious reason is that otherwise userns root in a non-initial user
> > > >   namespace can allocate a huge number of binder devices (pick a random
> > > >   number say 10.000) and use up a lot of kernel memory.
> > >
> > > Root can do tons of other bad things too, why are you picking on
> >
> > That's global root not userns root though. :) These sysctls are about
> > global root gaining the ability to proactively restrict binder device
> > delegation.
> >
> > > binderfs here?  :)
> > >
> > > >   In addition they can pound on the binder.c code causing a lot of
> > > >   contention for the remaining global lock in there.
> > >
> > > That's the problem of that container, don't let it do that.  Or remove
> > > the global lock :)
> > >
> > > >   We should let global root explicitly restrict non-initial namespaces
> > > >   in this respect. Imho, that's just good security design. :)
> > >
> > > If you do not trust your container enough to have it properly allocate
> > > the correct binder resources, then perhaps you shouldn't be allowing it
> > > to allocate any resources at all?
> >
> > Containers just like VMs get delegated and you might not have control
> > over what is running in there. That's AWS in a nutshell. :) Restricting
> > it by saying "just don't do that" seems not something that is
> > appropriate given the workloads out there in the wild.
> > In general, I do *understand* the reasoning but I think the premise is
> > flawed if we can somewhat trivially make this safe.
> >
> > >
> > > > - The reason for having a number of reserved devices is when the initial
> > > >   binderfs mount needs to bump the number of binder devices after the
> > > >   initial allocation done during say boot (e.g. it could've removed
> > > >   devices and wants to reallocate new ones but all binder minor numbers
> > > >   have been given out or just needs additional devices). By reserving an
> > > >   initial pool of binder devices this can be easily accounted for and
> > > >   future proofs userspace. This is to say: global root in the initial
> > > >   userns + ipcns gets dibs on however many devices it wants. :)
> > >
> > > binder devices do not "come and go" at runtime, you need to set them up
> > > initially and then all is fine.  So there should never be a need for the
> > > "global" instance to need "more" binder devices once it is up and
> > > running.  So I don't see what you are really trying to solve here.
> >
> > That's dismissing a whole range of use-cases where you might allocate
> > and deallocate devices on the fly which this is somewhat designed for.
> > But I guess ok for now.
> >
> > >
> > > You seem to be trying to protect the system from the container you just
> > > gave root to and trusted it with creating its own binder instances.
> > > If you do not trust it to create binder instances then do not allow it
> > > to create binder instances!  :)
> >
> > Again, I get the reasoning but think that this dismisses major
> > real-world use-cases not just for binderfs but for all instances where
> > untrusted workloads are run which both containers and VMs aim to make
> > sure are possible.
> > Note, I mean untrusted not in the sense of necessarily being malicious
> > but just "can't guarantee that things don't blow up in your face".
> >
> > >
> > > > - The fact that we have a single shared pool of binder device minor
> > > >   numbers for all namespaces imho makes it necessary for the global root
> > > >   user in the initial ipc + user namespace to manage device allocation
> > > >   and delegation.
> > >
> > > You are managing the allocation, you are giving who ever asks for one a
> > > device.  If you run out of devices, oops, you run out of devices, that's
> > > it.  Are you really ever going to run out of a major's number of binder
> > > devices?
> >
> > The point is more about someone intentionally trying to do that.
> >
> > >
> > > > The binderfs sysctl stuff is really small code-wise and adds a lot of
> > > > security without any performance impact on the code itself. So we
> > > > actually very strictly adhere to the requirement to not blindly
> > > > sacrifice performance for security. :)
> > >
> > > But you are adding a brand new user/kernel api by emulating one that is
> > > very old and not the best at all, to try to protect from something that
> > > seems like you can't really "protect" from in the first place.
> >
> > Of course we can protect from that. It's about init root never running
> > out of devices. We don't care about non-init-userns running out of
> > devices at all.
> >
> > >
> > > You now have a mis-match of sysctls, ioctls and file operations all
> > > working on the same logical thing.  And all interacting in different and
> > > uncertian ways.  Are you sure that's wise?
> >
> > The sysctl approach is present in other pseudo-filesystems apart from
> > devpts. For example, mqueue.
> >
> > >
> > > If the binderfs code as-is isn't "safe enough" to use without this, then
> > > we need to revisit it before someone starts to use it...
> >
> > *It is safe.* I just don't see a good argument against additional
> > hardening. *But I'm definitely not going to push this patch if it's
> > additional hardening features are used to push the unsound argument that
> > binderfs isn't safe.* :)
>
> Ok, so what you really want is just "limits" to prevent a container from
> doing something really crazy, right?  So, how about a limit of 10 binder
> nodes per container?  Make it a kernel build option so it can be changed
> by a vendor if they really find that is a problem.
>
> Would that solve the issue you are thinking might be here?

We already have CONFIG_ANDROID_BINDER_DEVICES which specifies 3
devices (binder,hwbinder,vndbinder). How about limiting each container
to no more than the number of devices specified there?

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] binderfs: implement sysctls
  2018-12-21 16:33         ` Greg KH
  2018-12-21 16:38           ` Todd Kjos
@ 2018-12-21 19:20           ` Christian Brauner
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2018-12-21 19:20 UTC (permalink / raw)
  To: Greg KH; +Cc: devel, tkjos, linux-kernel, arve, joel, maco

On Fri, Dec 21, 2018 at 05:33:16PM +0100, Greg KH wrote:
> On Fri, Dec 21, 2018 at 04:59:19PM +0100, Christian Brauner wrote:
> > On Fri, Dec 21, 2018 at 04:37:58PM +0100, Greg KH wrote:
> > > On Fri, Dec 21, 2018 at 03:12:42PM +0100, Christian Brauner wrote:
> > > > On Fri, Dec 21, 2018 at 02:55:09PM +0100, Greg KH wrote:
> > > > > On Fri, Dec 21, 2018 at 02:39:09PM +0100, Christian Brauner wrote:
> > > > > > This implements three sysctls that have very specific goals:
> > > > > 
> > > > > Ick, why?
> > > > > 
> > > > > What are these going to be used for?  Who will "control" them?  As you
> > > > 
> > > > Only global root in the initial user namespace. See the reasons below. :)
> > > > 
> > > > > are putting them in the "global" namespace, that feels like something
> > > > > that binderfs was trying to avoid in the first place.
> > > > 
> > > > There are a couple of reason imho:
> > > > - Global root needs a way to restrict how many binder devices can be
> > > >   allocated across all user + ipc namespace pairs.
> > > >   One obvious reason is that otherwise userns root in a non-initial user
> > > >   namespace can allocate a huge number of binder devices (pick a random
> > > >   number say 10.000) and use up a lot of kernel memory.
> > > 
> > > Root can do tons of other bad things too, why are you picking on
> > 
> > That's global root not userns root though. :) These sysctls are about
> > global root gaining the ability to proactively restrict binder device
> > delegation.
> > 
> > > binderfs here?  :)
> > > 
> > > >   In addition they can pound on the binder.c code causing a lot of
> > > >   contention for the remaining global lock in there.
> > > 
> > > That's the problem of that container, don't let it do that.  Or remove
> > > the global lock :)
> > > 
> > > >   We should let global root explicitly restrict non-initial namespaces
> > > >   in this respect. Imho, that's just good security design. :)
> > > 
> > > If you do not trust your container enough to have it properly allocate
> > > the correct binder resources, then perhaps you shouldn't be allowing it
> > > to allocate any resources at all?
> > 
> > Containers just like VMs get delegated and you might not have control
> > over what is running in there. That's AWS in a nutshell. :) Restricting
> > it by saying "just don't do that" seems not something that is
> > appropriate given the workloads out there in the wild.
> > In general, I do *understand* the reasoning but I think the premise is
> > flawed if we can somewhat trivially make this safe.
> > 
> > > 
> > > > - The reason for having a number of reserved devices is when the initial
> > > >   binderfs mount needs to bump the number of binder devices after the
> > > >   initial allocation done during say boot (e.g. it could've removed
> > > >   devices and wants to reallocate new ones but all binder minor numbers
> > > >   have been given out or just needs additional devices). By reserving an
> > > >   initial pool of binder devices this can be easily accounted for and
> > > >   future proofs userspace. This is to say: global root in the initial
> > > >   userns + ipcns gets dibs on however many devices it wants. :)
> > > 
> > > binder devices do not "come and go" at runtime, you need to set them up
> > > initially and then all is fine.  So there should never be a need for the
> > > "global" instance to need "more" binder devices once it is up and
> > > running.  So I don't see what you are really trying to solve here.
> > 
> > That's dismissing a whole range of use-cases where you might allocate
> > and deallocate devices on the fly which this is somewhat designed for.
> > But I guess ok for now.
> > 
> > > 
> > > You seem to be trying to protect the system from the container you just
> > > gave root to and trusted it with creating its own binder instances.
> > > If you do not trust it to create binder instances then do not allow it
> > > to create binder instances!  :)
> > 
> > Again, I get the reasoning but think that this dismisses major
> > real-world use-cases not just for binderfs but for all instances where
> > untrusted workloads are run which both containers and VMs aim to make
> > sure are possible.
> > Note, I mean untrusted not in the sense of necessarily being malicious
> > but just "can't guarantee that things don't blow up in your face".
> > 
> > > 
> > > > - The fact that we have a single shared pool of binder device minor
> > > >   numbers for all namespaces imho makes it necessary for the global root
> > > >   user in the initial ipc + user namespace to manage device allocation
> > > >   and delegation.
> > > 
> > > You are managing the allocation, you are giving who ever asks for one a
> > > device.  If you run out of devices, oops, you run out of devices, that's
> > > it.  Are you really ever going to run out of a major's number of binder
> > > devices?
> > 
> > The point is more about someone intentionally trying to do that.
> > 
> > > 
> > > > The binderfs sysctl stuff is really small code-wise and adds a lot of
> > > > security without any performance impact on the code itself. So we
> > > > actually very strictly adhere to the requirement to not blindly
> > > > sacrifice performance for security. :)
> > > 
> > > But you are adding a brand new user/kernel api by emulating one that is
> > > very old and not the best at all, to try to protect from something that
> > > seems like you can't really "protect" from in the first place.
> > 
> > Of course we can protect from that. It's about init root never running
> > out of devices. We don't care about non-init-userns running out of
> > devices at all.
> > 
> > > 
> > > You now have a mis-match of sysctls, ioctls and file operations all
> > > working on the same logical thing.  And all interacting in different and
> > > uncertian ways.  Are you sure that's wise?
> > 
> > The sysctl approach is present in other pseudo-filesystems apart from
> > devpts. For example, mqueue.
> > 
> > > 
> > > If the binderfs code as-is isn't "safe enough" to use without this, then
> > > we need to revisit it before someone starts to use it...
> > 
> > *It is safe.* I just don't see a good argument against additional
> > hardening. *But I'm definitely not going to push this patch if it's
> > additional hardening features are used to push the unsound argument that
> > binderfs isn't safe.* :)
> 
> Ok, so what you really want is just "limits" to prevent a container from
> doing something really crazy, right?  So, how about a limit of 10 binder
> nodes per container?  Make it a kernel build option so it can be changed
> by a vendor if they really find that is a problem.
> 
> Would that solve the issue you are thinking might be here?

Actually you brought me to reason through this with a couple of people
and one thing I totally forgot and which was pointed out to me is that
we can limit the container's memory consumption via the memory cgroup.
So consuming all kernel memory wouldn't be possible with this. And if
you don't use the memory cgroup you're hosed in a variety of other ways
anyway. :) So that attack is gone which means we can do away with a
global limit. Which I very much prefer! Thanks for making me look harder
into the attack surface!

What still makes sense is a per-instance limit, i.e. a max=<count> mount
option. So the initial root user can mount binderfs:

mkdir /dev/binderfs
mount -t binder binder -o max=1024 /dev/binderfs

grant ACLs or whatever and bind-mount the whole binderfs host instance
into another user + ipc namespace. This way there's a hard-cap at "max"
since the "max" mount option cannot be changed by root in a non-initial
user namespace. That secures the use-case where users don't want to have
multiple binderfs mounts but rather use a single one which some
requested afair.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-12-21 19:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-21 13:39 [PATCH] binderfs: implement sysctls Christian Brauner
2018-12-21 13:55 ` Greg KH
2018-12-21 14:12   ` Christian Brauner
2018-12-21 15:37     ` Greg KH
2018-12-21 15:59       ` Christian Brauner
2018-12-21 16:33         ` Greg KH
2018-12-21 16:38           ` Todd Kjos
2018-12-21 19:20           ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).