All of lore.kernel.org
 help / color / mirror / Atom feed
* Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
@ 2012-11-06  2:38 Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06  2:38 UTC (permalink / raw)
  To: Aristeu Rozanski, Serge E. Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, guys.

Why doesn't it follow the usual security enforced by cgroupfs
permissions?  Why is the explicit check necessary?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                       ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 18:25                                         ` Serge Hallyn
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 18:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> On Tue, Nov 06, 2012 at 12:12:14PM -0600, Serge Hallyn wrote:
> > Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > > > We can't generally require a capability to move tasks between cgroups,
> > > > as that will break currently intended uses.  I can create two cgroups,
> > > > chown them to serge, and let serge move between them.
> > > 
> > > Sure, then just live with the cgroupfs based permission check.  What
> > > next?  Should we add CAP_SYS_RESOURCE check to all resource related
> > 
> > That would be the next step, yes.
> 
> Hmmm... I don't know.  What does that buy us?  Fine grained CAP_* is
> to be able to give away privliedges in smaller pieces, right?  If
> we're gonna be requiring root to modify cgroup, what difference does
> it make to have finer grained CAPs specified?
> 
> > > controllers?  Moreover, We're headed to unified hierarchy, so in the
> > > end that means only the user with almost all CAP_* can manipulate
> > > cgroups at all making the whole thing meaningless.
> > 
> > Why meaningless?  Many caps needed to "do everything", but moving
> > a task into the freezer and freezing it, or reducing its allowed
> > memory, would only requiring uid equiv or some limited bit of
> > privilege.
> 
> All controllers will be co-mounted and you'll need all CAPs needed by

Oh.  I thought each controller could only be mounted once, but not
that all must be co-mounted.

Jinkeys.

> all controllers to move tasks between cgroups and there won't be an
> easy way to tell which CAP is missing.  Doesn't sound too useful to
> me.
> 
> > > I don't think applying fine-grained CAP_* to cgroup controllers makes
> > > sense or would be useful in any real sense.  We can introduce, say,
> > > CAP_CGROUP to control access cgroupfs but I think we already have
> > > enough access control to cgroupfs, don't we?
> > 
> > That's the question :)
> > 
> > I feel like we need a list of the various uses people have in mind,
> > so we can figure out which ones are supportable...  but I know there
> > is the whole long thread I've not had time to keep up with, and
> > many answers are probably there.
> 
> Care to give a pointer?

I mean the recent thread you started on cgroup cleanups :)

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 18:12                                   ` Serge Hallyn
  2012-11-06 18:16                                     ` Tejun Heo
@ 2012-11-06 18:16                                     ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 18:16 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Hello, Serge.

On Tue, Nov 06, 2012 at 12:12:14PM -0600, Serge Hallyn wrote:
> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > > We can't generally require a capability to move tasks between cgroups,
> > > as that will break currently intended uses.  I can create two cgroups,
> > > chown them to serge, and let serge move between them.
> > 
> > Sure, then just live with the cgroupfs based permission check.  What
> > next?  Should we add CAP_SYS_RESOURCE check to all resource related
> 
> That would be the next step, yes.

Hmmm... I don't know.  What does that buy us?  Fine grained CAP_* is
to be able to give away privliedges in smaller pieces, right?  If
we're gonna be requiring root to modify cgroup, what difference does
it make to have finer grained CAPs specified?

> > controllers?  Moreover, We're headed to unified hierarchy, so in the
> > end that means only the user with almost all CAP_* can manipulate
> > cgroups at all making the whole thing meaningless.
> 
> Why meaningless?  Many caps needed to "do everything", but moving
> a task into the freezer and freezing it, or reducing its allowed
> memory, would only requiring uid equiv or some limited bit of
> privilege.

All controllers will be co-mounted and you'll need all CAPs needed by
all controllers to move tasks between cgroups and there won't be an
easy way to tell which CAP is missing.  Doesn't sound too useful to
me.

> > I don't think applying fine-grained CAP_* to cgroup controllers makes
> > sense or would be useful in any real sense.  We can introduce, say,
> > CAP_CGROUP to control access cgroupfs but I think we already have
> > enough access control to cgroupfs, don't we?
> 
> That's the question :)
> 
> I feel like we need a list of the various uses people have in mind,
> so we can figure out which ones are supportable...  but I know there
> is the whole long thread I've not had time to keep up with, and
> many answers are probably there.

Care to give a pointer?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 18:12                                   ` Serge Hallyn
@ 2012-11-06 18:16                                     ` Tejun Heo
       [not found]                                       ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 18:16                                     ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 18:16 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Serge.

On Tue, Nov 06, 2012 at 12:12:14PM -0600, Serge Hallyn wrote:
> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > > We can't generally require a capability to move tasks between cgroups,
> > > as that will break currently intended uses.  I can create two cgroups,
> > > chown them to serge, and let serge move between them.
> > 
> > Sure, then just live with the cgroupfs based permission check.  What
> > next?  Should we add CAP_SYS_RESOURCE check to all resource related
> 
> That would be the next step, yes.

Hmmm... I don't know.  What does that buy us?  Fine grained CAP_* is
to be able to give away privliedges in smaller pieces, right?  If
we're gonna be requiring root to modify cgroup, what difference does
it make to have finer grained CAPs specified?

> > controllers?  Moreover, We're headed to unified hierarchy, so in the
> > end that means only the user with almost all CAP_* can manipulate
> > cgroups at all making the whole thing meaningless.
> 
> Why meaningless?  Many caps needed to "do everything", but moving
> a task into the freezer and freezing it, or reducing its allowed
> memory, would only requiring uid equiv or some limited bit of
> privilege.

All controllers will be co-mounted and you'll need all CAPs needed by
all controllers to move tasks between cgroups and there won't be an
easy way to tell which CAP is missing.  Doesn't sound too useful to
me.

> > I don't think applying fine-grained CAP_* to cgroup controllers makes
> > sense or would be useful in any real sense.  We can introduce, say,
> > CAP_CGROUP to control access cgroupfs but I think we already have
> > enough access control to cgroupfs, don't we?
> 
> That's the question :)
> 
> I feel like we need a list of the various uses people have in mind,
> so we can figure out which ones are supportable...  but I know there
> is the whole long thread I've not had time to keep up with, and
> many answers are probably there.

Care to give a pointer?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                 ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 17:41                                   ` Tejun Heo
  2012-11-06 18:12                                   ` Serge Hallyn
@ 2012-11-06 18:12                                   ` Serge Hallyn
  2 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello,
> 
> On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > We can't generally require a capability to move tasks between cgroups,
> > as that will break currently intended uses.  I can create two cgroups,
> > chown them to serge, and let serge move between them.
> 
> Sure, then just live with the cgroupfs based permission check.  What
> next?  Should we add CAP_SYS_RESOURCE check to all resource related

That would be the next step, yes.

> controllers?  Moreover, We're headed to unified hierarchy, so in the
> end that means only the user with almost all CAP_* can manipulate
> cgroups at all making the whole thing meaningless.

Why meaningless?  Many caps needed to "do everything", but moving
a task into the freezer and freezing it, or reducing its allowed
memory, would only requiring uid equiv or some limited bit of
privilege.

> I don't think applying fine-grained CAP_* to cgroup controllers makes
> sense or would be useful in any real sense.  We can introduce, say,
> CAP_CGROUP to control access cgroupfs but I think we already have
> enough access control to cgroupfs, don't we?

That's the question :)

I feel like we need a list of the various uses people have in mind,
so we can figure out which ones are supportable...  but I know there
is the whole long thread I've not had time to keep up with, and
many answers are probably there.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                 ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 17:41                                   ` Tejun Heo
@ 2012-11-06 18:12                                   ` Serge Hallyn
  2012-11-06 18:16                                     ` Tejun Heo
  2012-11-06 18:16                                     ` Tejun Heo
  2012-11-06 18:12                                   ` Serge Hallyn
  2 siblings, 2 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello,
> 
> On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > We can't generally require a capability to move tasks between cgroups,
> > as that will break currently intended uses.  I can create two cgroups,
> > chown them to serge, and let serge move between them.
> 
> Sure, then just live with the cgroupfs based permission check.  What
> next?  Should we add CAP_SYS_RESOURCE check to all resource related

That would be the next step, yes.

> controllers?  Moreover, We're headed to unified hierarchy, so in the
> end that means only the user with almost all CAP_* can manipulate
> cgroups at all making the whole thing meaningless.

Why meaningless?  Many caps needed to "do everything", but moving
a task into the freezer and freezing it, or reducing its allowed
memory, would only requiring uid equiv or some limited bit of
privilege.

> I don't think applying fine-grained CAP_* to cgroup controllers makes
> sense or would be useful in any real sense.  We can introduce, say,
> CAP_CGROUP to control access cgroupfs but I think we already have
> enough access control to cgroupfs, don't we?

That's the question :)

I feel like we need a list of the various uses people have in mind,
so we can figure out which ones are supportable...  but I know there
is the whole long thread I've not had time to keep up with, and
many answers are probably there.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 18:02                                       ` Serge Hallyn
  2012-11-06 18:08                                         ` Tejun Heo
@ 2012-11-06 18:08                                         ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 18:08 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Hello, Serge.

On Tue, Nov 06, 2012 at 12:02:33PM -0600, Serge Hallyn wrote:
> So to be clear, if I want a user to be able to confine his own
> compute-intensive tasks and freeze them, the recommended route will be
> with privileged (setuid-root) helpers?

Something like that.  I think we'll eventually need a policy manager
in userland which controls the whole hierarchy.  Not sure how that
will look at this point tho, but cgroupfs will be an interface to just
expose cgroup configuration directly rather than allow multiplexing
userland / namespace / whatnot multiplexing on top of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 18:02                                       ` Serge Hallyn
@ 2012-11-06 18:08                                         ` Tejun Heo
  2012-11-06 18:08                                         ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 18:08 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Serge.

On Tue, Nov 06, 2012 at 12:02:33PM -0600, Serge Hallyn wrote:
> So to be clear, if I want a user to be able to confine his own
> compute-intensive tasks and freeze them, the recommended route will be
> with privileged (setuid-root) helpers?

Something like that.  I think we'll eventually need a policy manager
in userland which controls the whole hierarchy.  Not sure how that
will look at this point tho, but cgroupfs will be an interface to just
expose cgroup configuration directly rather than allow multiplexing
userland / namespace / whatnot multiplexing on top of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                     ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 18:02                                       ` Serge Hallyn
@ 2012-11-06 18:02                                       ` Serge Hallyn
  1 sibling, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Just one more thing.
> 
> On Tue, Nov 06, 2012 at 09:38:23AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > > We can't generally require a capability to move tasks between cgroups,
> > > as that will break currently intended uses.  I can create two cgroups,
> > > chown them to serge, and let serge move between them.
> > 
> > Sure, then just live with the cgroupfs based permission check.  What
> > next?  Should we add CAP_SYS_RESOURCE check to all resource related
> > controllers?  Moreover, We're headed to unified hierarchy, so in the
> > end that means only the user with almost all CAP_* can manipulate
> > cgroups at all making the whole thing meaningless.
> 
> As for using cgroup as !root user, I would advise not doing that.
> Again, we're moving toward a unified cgroup hierarchy.  You wouldn't
> be creating multiple cgroup hierarchies and assigning different user
> accesses to them.  Also, I would strongly discourage chowning sub
> directories in cgroupfs and letting non-priviledged users modify them
> directly.

So to be clear, if I want a user to be able to confine his own
compute-intensive tasks and freeze them, the recommended route will be
with privileged (setuid-root) helpers?

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                     ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 18:02                                       ` Serge Hallyn
  2012-11-06 18:08                                         ` Tejun Heo
  2012-11-06 18:08                                         ` Tejun Heo
  2012-11-06 18:02                                       ` Serge Hallyn
  1 sibling, 2 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Just one more thing.
> 
> On Tue, Nov 06, 2012 at 09:38:23AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > > We can't generally require a capability to move tasks between cgroups,
> > > as that will break currently intended uses.  I can create two cgroups,
> > > chown them to serge, and let serge move between them.
> > 
> > Sure, then just live with the cgroupfs based permission check.  What
> > next?  Should we add CAP_SYS_RESOURCE check to all resource related
> > controllers?  Moreover, We're headed to unified hierarchy, so in the
> > end that means only the user with almost all CAP_* can manipulate
> > cgroups at all making the whole thing meaningless.
> 
> As for using cgroup as !root user, I would advise not doing that.
> Again, we're moving toward a unified cgroup hierarchy.  You wouldn't
> be creating multiple cgroup hierarchies and assigning different user
> accesses to them.  Also, I would strongly discourage chowning sub
> directories in cgroupfs and letting non-priviledged users modify them
> directly.

So to be clear, if I want a user to be able to confine his own
compute-intensive tasks and freeze them, the recommended route will be
with privileged (setuid-root) helpers?

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                                 ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 17:41                                   ` Tejun Heo
       [not found]                                     ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 18:12                                   ` Serge Hallyn
  2012-11-06 18:12                                   ` Serge Hallyn
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 17:41 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Just one more thing.

On Tue, Nov 06, 2012 at 09:38:23AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> > We can't generally require a capability to move tasks between cgroups,
> > as that will break currently intended uses.  I can create two cgroups,
> > chown them to serge, and let serge move between them.
> 
> Sure, then just live with the cgroupfs based permission check.  What
> next?  Should we add CAP_SYS_RESOURCE check to all resource related
> controllers?  Moreover, We're headed to unified hierarchy, so in the
> end that means only the user with almost all CAP_* can manipulate
> cgroups at all making the whole thing meaningless.

As for using cgroup as !root user, I would advise not doing that.
Again, we're moving toward a unified cgroup hierarchy.  You wouldn't
be creating multiple cgroup hierarchies and assigning different user
accesses to them.  Also, I would strongly discourage chowning sub
directories in cgroupfs and letting non-priviledged users modify them
directly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 17:31                             ` Serge Hallyn
  2012-11-06 17:38                               ` Tejun Heo
@ 2012-11-06 17:38                               ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 17:38 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Hello,

On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> We can't generally require a capability to move tasks between cgroups,
> as that will break currently intended uses.  I can create two cgroups,
> chown them to serge, and let serge move between them.

Sure, then just live with the cgroupfs based permission check.  What
next?  Should we add CAP_SYS_RESOURCE check to all resource related
controllers?  Moreover, We're headed to unified hierarchy, so in the
end that means only the user with almost all CAP_* can manipulate
cgroups at all making the whole thing meaningless.

I don't think applying fine-grained CAP_* to cgroup controllers makes
sense or would be useful in any real sense.  We can introduce, say,
CAP_CGROUP to control access cgroupfs but I think we already have
enough access control to cgroupfs, don't we?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 17:31                             ` Serge Hallyn
@ 2012-11-06 17:38                               ` Tejun Heo
       [not found]                                 ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 17:38                               ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 17:38 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote:
> We can't generally require a capability to move tasks between cgroups,
> as that will break currently intended uses.  I can create two cgroups,
> chown them to serge, and let serge move between them.

Sure, then just live with the cgroupfs based permission check.  What
next?  Should we add CAP_SYS_RESOURCE check to all resource related
controllers?  Moreover, We're headed to unified hierarchy, so in the
end that means only the user with almost all CAP_* can manipulate
cgroups at all making the whole thing meaningless.

I don't think applying fine-grained CAP_* to cgroup controllers makes
sense or would be useful in any real sense.  We can introduce, say,
CAP_CGROUP to control access cgroupfs but I think we already have
enough access control to cgroupfs, don't we?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                           ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 17:31                             ` Serge Hallyn
  2012-11-06 17:31                             ` Serge Hallyn
  1 sibling, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 17:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Eric.
> 
> On Tue, Nov 06, 2012 at 08:10:10AM -0800, Eric W. Biederman wrote:
> > mknod is gated by the vfs with a capability call.
> > 
> > open does not perform the CAP_MKNOD check.
> > 
> > Since the device cgroup prevents opening of device nodes adding
> > permission to access a new device node (update_access) is roughly
> > equivalent to mknod when the device cgroup does not exist.
> 
> I think that's a jump.
> 
> > To preserve the notion that only a privileged user can grant access to
> > device nodes we need a capability check.  Especially since the device
> > cgroup is designed to limit processes with uid == 0.
> > 
> > Without a capability check a process with CAP_DAC_OVERRIDE can go
> > shopping for a device control group that happens to have the device it
> > wants to use.
> > 
> > Similary without a capability check a process with CAP_DAD_OVERRIDE can
> > add or remove any device node into a device control group.
> > 
> > I don't see how the device control group can limit uid == 0 with the
> > device control group without making the operations require a capability
> > you don't give to ever user who has uid == 0.
> 
> devices cgroup adds to restrictions what a group of tasks can do.
> Access to cgroup configuration is gated by cgroup core (currently by
> VFS permissions) and that's it.  I really don't want each controller
> to develop its own permission checks.  If a controller can't live with
> that, it probably shouldn't be a cgroup controller.  So, if you think
> the CAP check is needed for cgroup in general (and can justify it),
> please feel free to move it to cgroup core; otherwise, the CAP check
> is going away from devices and if devices can't live with that, it
> probably shouldn't have been a cgroup controller from the beginning.

We can't generally require a capability to move tasks between cgroups,
as that will break currently intended uses.  I can create two cgroups,
chown them to serge, and let serge move between them.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                           ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 17:31                             ` Serge Hallyn
@ 2012-11-06 17:31                             ` Serge Hallyn
  2012-11-06 17:38                               ` Tejun Heo
  2012-11-06 17:38                               ` Tejun Heo
  1 sibling, 2 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 17:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Eric.
> 
> On Tue, Nov 06, 2012 at 08:10:10AM -0800, Eric W. Biederman wrote:
> > mknod is gated by the vfs with a capability call.
> > 
> > open does not perform the CAP_MKNOD check.
> > 
> > Since the device cgroup prevents opening of device nodes adding
> > permission to access a new device node (update_access) is roughly
> > equivalent to mknod when the device cgroup does not exist.
> 
> I think that's a jump.
> 
> > To preserve the notion that only a privileged user can grant access to
> > device nodes we need a capability check.  Especially since the device
> > cgroup is designed to limit processes with uid == 0.
> > 
> > Without a capability check a process with CAP_DAC_OVERRIDE can go
> > shopping for a device control group that happens to have the device it
> > wants to use.
> > 
> > Similary without a capability check a process with CAP_DAD_OVERRIDE can
> > add or remove any device node into a device control group.
> > 
> > I don't see how the device control group can limit uid == 0 with the
> > device control group without making the operations require a capability
> > you don't give to ever user who has uid == 0.
> 
> devices cgroup adds to restrictions what a group of tasks can do.
> Access to cgroup configuration is gated by cgroup core (currently by
> VFS permissions) and that's it.  I really don't want each controller
> to develop its own permission checks.  If a controller can't live with
> that, it probably shouldn't be a cgroup controller.  So, if you think
> the CAP check is needed for cgroup in general (and can justify it),
> please feel free to move it to cgroup core; otherwise, the CAP check
> is going away from devices and if devices can't live with that, it
> probably shouldn't have been a cgroup controller from the beginning.

We can't generally require a capability to move tasks between cgroups,
as that will break currently intended uses.  I can create two cgroups,
chown them to serge, and let serge move between them.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                       ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-11-06 16:52                         ` Tejun Heo
       [not found]                           ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 16:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Eric.

On Tue, Nov 06, 2012 at 08:10:10AM -0800, Eric W. Biederman wrote:
> mknod is gated by the vfs with a capability call.
> 
> open does not perform the CAP_MKNOD check.
> 
> Since the device cgroup prevents opening of device nodes adding
> permission to access a new device node (update_access) is roughly
> equivalent to mknod when the device cgroup does not exist.

I think that's a jump.

> To preserve the notion that only a privileged user can grant access to
> device nodes we need a capability check.  Especially since the device
> cgroup is designed to limit processes with uid == 0.
> 
> Without a capability check a process with CAP_DAC_OVERRIDE can go
> shopping for a device control group that happens to have the device it
> wants to use.
> 
> Similary without a capability check a process with CAP_DAD_OVERRIDE can
> add or remove any device node into a device control group.
> 
> I don't see how the device control group can limit uid == 0 with the
> device control group without making the operations require a capability
> you don't give to ever user who has uid == 0.

devices cgroup adds to restrictions what a group of tasks can do.
Access to cgroup configuration is gated by cgroup core (currently by
VFS permissions) and that's it.  I really don't want each controller
to develop its own permission checks.  If a controller can't live with
that, it probably shouldn't be a cgroup controller.  So, if you think
the CAP check is needed for cgroup in general (and can justify it),
please feel free to move it to cgroup core; otherwise, the CAP check
is going away from devices and if devices can't live with that, it
probably shouldn't have been a cgroup controller from the beginning.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                 ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 16:12                   ` Aristeu Rozanski
@ 2012-11-06 16:12                   ` Aristeu Rozanski
  1 sibling, 0 replies; 37+ messages in thread
From: Aristeu Rozanski @ 2012-11-06 16:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 06, 2012 at 07:41:05AM -0800, Tejun Heo wrote:
> On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote:
> > > So, you don't really have any actual use case for the explicit CAP_*
> > > checks, right?
> > 
> > No, especially since we will now have user namespaces.
> > 
> > We will want to be able to strictly enforce hierarchical limits - i.e.
> > allow uid 100000 (which is uid 0 in the container) to change cgroup
> > settings, but never exceed limits set on the parent directory.  IIUC
> > you are working toward anyway with the general hierarchy work? (thanks
> > for that).
> 
> Yeah, I'm working toward that but I'm not sure it would mean that
> containers would be able to directly bind mount cgroupfs subdirectory
> and have free reign on it.  Maybe such thing can be made to work but I
> would be much more comfortable having something inbetween for
> impedance matching (in access policies, root cgroup behavior matching,
> whatnot).  So, the functionality will be there but it probably would
> need something inbetween if you wanna give containers control over its
> own cgroup hierarchy.
> 
> There are some issues tho.  As it currently stands, devices cgroup
> inherits configuration rather than enforcing hierarchy while checking
> for access permission.  This means that changes in an ancestor have to
> be propagated downwards and *update* configurations of descendants,
> which is what I'm working on but it can be confusing for someone
> inside the container.  Without breaking compatibility, I don't see any
> other way out tho.  I suppose it's something we'll have to live with.

yes, but most of the changes will happen before the containers start.
I've been working in a patchset (as promised) and it propagates down the
changes and also keep 'local' rules. everytime a parent change
something, the local rules are re-evaluated and reapplied if they're
still valid.

-- 
Aristeu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                 ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 16:12                   ` Aristeu Rozanski
  2012-11-06 16:12                   ` Aristeu Rozanski
  1 sibling, 0 replies; 37+ messages in thread
From: Aristeu Rozanski @ 2012-11-06 16:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge Hallyn, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

On Tue, Nov 06, 2012 at 07:41:05AM -0800, Tejun Heo wrote:
> On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote:
> > > So, you don't really have any actual use case for the explicit CAP_*
> > > checks, right?
> > 
> > No, especially since we will now have user namespaces.
> > 
> > We will want to be able to strictly enforce hierarchical limits - i.e.
> > allow uid 100000 (which is uid 0 in the container) to change cgroup
> > settings, but never exceed limits set on the parent directory.  IIUC
> > you are working toward anyway with the general hierarchy work? (thanks
> > for that).
> 
> Yeah, I'm working toward that but I'm not sure it would mean that
> containers would be able to directly bind mount cgroupfs subdirectory
> and have free reign on it.  Maybe such thing can be made to work but I
> would be much more comfortable having something inbetween for
> impedance matching (in access policies, root cgroup behavior matching,
> whatnot).  So, the functionality will be there but it probably would
> need something inbetween if you wanna give containers control over its
> own cgroup hierarchy.
> 
> There are some issues tho.  As it currently stands, devices cgroup
> inherits configuration rather than enforcing hierarchy while checking
> for access permission.  This means that changes in an ancestor have to
> be propagated downwards and *update* configurations of descendants,
> which is what I'm working on but it can be confusing for someone
> inside the container.  Without breaking compatibility, I don't see any
> other way out tho.  I suppose it's something we'll have to live with.

yes, but most of the changes will happen before the containers start.
I've been working in a patchset (as promised) and it propagates down the
changes and also keep 'local' rules. everytime a parent change
something, the local rules are re-evaluated and reapplied if they're
still valid.

-- 
Aristeu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                   ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 16:10                     ` Eric W. Biederman
@ 2012-11-06 16:10                     ` Eric W. Biederman
  1 sibling, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-06 16:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hey, Eric.
>
> On Tue, Nov 06, 2012 at 07:34:07AM -0800, Eric W. Biederman wrote:
>> Having thought about this a little more I can give a definitive answer.
>> 
>> Adding a process to the device control group is equivalent to calling
>> mknod, as it allows that process to open device nodes, or equivalently
>> not open device nodes.  Therefore a capable check is absolutely
>> required.
>> 
>> Without a capability check it would be possible to remove access to
>> /dev/console for a suid root application keeping it from reporting
>> attempts to hack it for example.
>
> You understand that the whole thing is gated by VFS permission check,
> right?  I'm kinda lost what you're talking about.

mknod is gated by the vfs with a capability call.

open does not perform the CAP_MKNOD check.

Since the device cgroup prevents opening of device nodes adding
permission to access a new device node (update_access) is roughly
equivalent to mknod when the device cgroup does not exist.

To preserve the notion that only a privileged user can grant access to
device nodes we need a capability check.  Especially since the device
cgroup is designed to limit processes with uid == 0.

Without a capability check a process with CAP_DAC_OVERRIDE can go
shopping for a device control group that happens to have the device it
wants to use.

Similary without a capability check a process with CAP_DAD_OVERRIDE can
add or remove any device node into a device control group.

I don't see how the device control group can limit uid == 0 with the
device control group without making the operations require a capability
you don't give to ever user who has uid == 0.

>> The generic cgroup check in attach_task_by_pid to see if you can move
>> another process into a cgroup needs to be a capability call and not a
>> test for uid == 0.
>> 
>> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>> {
>> 	if (pid) {
>> 		tsk = find_task_by_vpid(pid);
>> 
>> 		/*
>> 		 * even if we're attaching all tasks in the thread group, we
>> 		 * only need to check permissions on one of them.
>> 		 */
>> 		tcred = __task_cred(tsk);
>> 		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>>                                         ^^^^^^^^^^^^^^^
>> 		    !uid_eq(cred->euid, tcred->uid) &&
>> 		    !uid_eq(cred->euid, tcred->suid)) {
>> 			rcu_read_unlock();
>> 			ret = -EACCES;
>> 			goto out_unlock_cgroup;
>
> This one isn't gated by VFS so we need to add CAP check to this
> function.  No?

correct.  The check should read something like:

 		tcred = __task_cred(tsk);
 		if (!uid_eq(cred->euid, tcred->uid) &&
 		    !uid_eq(cred->euid, tcred->suid) &&
                    !capable(CAP_SYS_ADMIN)) {
 			rcu_read_unlock();
 			ret = -EACCES;
 			goto out_unlock_cgroup;

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]                   ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 16:10                     ` Eric W. Biederman
       [not found]                       ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 16:10                     ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-06 16:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge Hallyn, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hey, Eric.
>
> On Tue, Nov 06, 2012 at 07:34:07AM -0800, Eric W. Biederman wrote:
>> Having thought about this a little more I can give a definitive answer.
>> 
>> Adding a process to the device control group is equivalent to calling
>> mknod, as it allows that process to open device nodes, or equivalently
>> not open device nodes.  Therefore a capable check is absolutely
>> required.
>> 
>> Without a capability check it would be possible to remove access to
>> /dev/console for a suid root application keeping it from reporting
>> attempts to hack it for example.
>
> You understand that the whole thing is gated by VFS permission check,
> right?  I'm kinda lost what you're talking about.

mknod is gated by the vfs with a capability call.

open does not perform the CAP_MKNOD check.

Since the device cgroup prevents opening of device nodes adding
permission to access a new device node (update_access) is roughly
equivalent to mknod when the device cgroup does not exist.

To preserve the notion that only a privileged user can grant access to
device nodes we need a capability check.  Especially since the device
cgroup is designed to limit processes with uid == 0.

Without a capability check a process with CAP_DAC_OVERRIDE can go
shopping for a device control group that happens to have the device it
wants to use.

Similary without a capability check a process with CAP_DAD_OVERRIDE can
add or remove any device node into a device control group.

I don't see how the device control group can limit uid == 0 with the
device control group without making the operations require a capability
you don't give to ever user who has uid == 0.

>> The generic cgroup check in attach_task_by_pid to see if you can move
>> another process into a cgroup needs to be a capability call and not a
>> test for uid == 0.
>> 
>> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>> {
>> 	if (pid) {
>> 		tsk = find_task_by_vpid(pid);
>> 
>> 		/*
>> 		 * even if we're attaching all tasks in the thread group, we
>> 		 * only need to check permissions on one of them.
>> 		 */
>> 		tcred = __task_cred(tsk);
>> 		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>>                                         ^^^^^^^^^^^^^^^
>> 		    !uid_eq(cred->euid, tcred->uid) &&
>> 		    !uid_eq(cred->euid, tcred->suid)) {
>> 			rcu_read_unlock();
>> 			ret = -EACCES;
>> 			goto out_unlock_cgroup;
>
> This one isn't gated by VFS so we need to add CAP check to this
> function.  No?

correct.  The check should read something like:

 		tcred = __task_cred(tsk);
 		if (!uid_eq(cred->euid, tcred->uid) &&
 		    !uid_eq(cred->euid, tcred->suid) &&
                    !capable(CAP_SYS_ADMIN)) {
 			rcu_read_unlock();
 			ret = -EACCES;
 			goto out_unlock_cgroup;

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]               ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 15:43                 ` Tejun Heo
  2012-11-06 15:45                 ` Serge Hallyn
@ 2012-11-06 15:45                 ` Serge Hallyn
  2 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> 
> > Hello, Serge.
> >
> > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> >> More practically, lacking user namespaces you can create a full (i.e.
> >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> >> root.  So this allows you to prevent containers from bypassing devices
> >> cgroup restrictions set by the parent.  (In reality we are not using
> >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> >> but other distros do.)
> >
> > Do you even mount cgroupfs in containers?  If you just bind-mount
> > cgroupfs verbatim in containers, I don't think that's gonna work very
> > well.  If not, all this doesn't make any difference for containers.
> >
> > So, you don't really have any actual use case for the explicit CAP_*
> > checks, right?
> 
> Having thought about this a little more I can give a definitive answer.
> 
> Adding a process to the device control group is equivalent to calling
> mknod, as it allows that process to open device nodes, or equivalently
> not open device nodes.  Therefore a capable check is absolutely
> required.
> 
> Without a capability check it would be possible to remove access to
> /dev/console for a suid root application keeping it from reporting
> attempts to hack it for example.
> 
> update_access can allow access to previously unaccessible devices
> and so is equivalent to mknod and as such requires a capability call.
> 
> static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> 				   int filetype, const char *buffer)
> {
> ....
> 	if (!capable(CAP_SYS_ADMIN))
> 		return -EPERM;
> 
> 
> Likewise move to a different cgroup can give you a completely different
> set of devices you can use.  And is roughly equivalent to mknod, and
> needs a capability call. 
> 
> static int devcgroup_can_attach(struct cgroup *new_cgrp,
> 				struct cgroup_taskset *set)
> {
> 	struct task_struct *task = cgroup_taskset_first(set);
> 
> 	if (current != task && !capable(CAP_SYS_ADMIN))
> 		return -EPERM;
> 
> 
> The generic cgroup check in attach_task_by_pid to see if you can move
> another process into a cgroup needs to be a capability call and not a
> test for uid == 0.
> 
> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> {
> 	if (pid) {
> 		tsk = find_task_by_vpid(pid);
> 
> 		/*
> 		 * even if we're attaching all tasks in the thread group, we
> 		 * only need to check permissions on one of them.
> 		 */
> 		tcred = __task_cred(tsk);
> 		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>                                         ^^^^^^^^^^^^^^^
> 		    !uid_eq(cred->euid, tcred->uid) &&
> 		    !uid_eq(cred->euid, tcred->suid)) {
> 			rcu_read_unlock();
> 			ret = -EACCES;
> 			goto out_unlock_cgroup;
> 
> Eric

(full context kept, though long, bc it's all important)

Note that part of the problem is simply that the devices cgroup is serving
as a stand-in for the lack of both user and device namespaces.  If those
both existed, we could get rid of the devices cgroup.  Likewise, the
presence of the devices cgroup makes a device namespace far less
compelling :)  We can play games with bind mounts into /dev and devcgroup
to do most of what we want a devicens for.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]               ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 15:43                 ` Tejun Heo
@ 2012-11-06 15:45                 ` Serge Hallyn
  2012-11-06 15:45                 ` Serge Hallyn
  2 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> 
> > Hello, Serge.
> >
> > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> >> More practically, lacking user namespaces you can create a full (i.e.
> >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> >> root.  So this allows you to prevent containers from bypassing devices
> >> cgroup restrictions set by the parent.  (In reality we are not using
> >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> >> but other distros do.)
> >
> > Do you even mount cgroupfs in containers?  If you just bind-mount
> > cgroupfs verbatim in containers, I don't think that's gonna work very
> > well.  If not, all this doesn't make any difference for containers.
> >
> > So, you don't really have any actual use case for the explicit CAP_*
> > checks, right?
> 
> Having thought about this a little more I can give a definitive answer.
> 
> Adding a process to the device control group is equivalent to calling
> mknod, as it allows that process to open device nodes, or equivalently
> not open device nodes.  Therefore a capable check is absolutely
> required.
> 
> Without a capability check it would be possible to remove access to
> /dev/console for a suid root application keeping it from reporting
> attempts to hack it for example.
> 
> update_access can allow access to previously unaccessible devices
> and so is equivalent to mknod and as such requires a capability call.
> 
> static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> 				   int filetype, const char *buffer)
> {
> ....
> 	if (!capable(CAP_SYS_ADMIN))
> 		return -EPERM;
> 
> 
> Likewise move to a different cgroup can give you a completely different
> set of devices you can use.  And is roughly equivalent to mknod, and
> needs a capability call. 
> 
> static int devcgroup_can_attach(struct cgroup *new_cgrp,
> 				struct cgroup_taskset *set)
> {
> 	struct task_struct *task = cgroup_taskset_first(set);
> 
> 	if (current != task && !capable(CAP_SYS_ADMIN))
> 		return -EPERM;
> 
> 
> The generic cgroup check in attach_task_by_pid to see if you can move
> another process into a cgroup needs to be a capability call and not a
> test for uid == 0.
> 
> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> {
> 	if (pid) {
> 		tsk = find_task_by_vpid(pid);
> 
> 		/*
> 		 * even if we're attaching all tasks in the thread group, we
> 		 * only need to check permissions on one of them.
> 		 */
> 		tcred = __task_cred(tsk);
> 		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>                                         ^^^^^^^^^^^^^^^
> 		    !uid_eq(cred->euid, tcred->uid) &&
> 		    !uid_eq(cred->euid, tcred->suid)) {
> 			rcu_read_unlock();
> 			ret = -EACCES;
> 			goto out_unlock_cgroup;
> 
> Eric

(full context kept, though long, bc it's all important)

Note that part of the problem is simply that the devices cgroup is serving
as a stand-in for the lack of both user and device namespaces.  If those
both existed, we could get rid of the devices cgroup.  Likewise, the
presence of the devices cgroup makes a device namespace far less
compelling :)  We can play games with bind mounts into /dev and devcgroup
to do most of what we want a devicens for.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]               ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-11-06 15:43                 ` Tejun Heo
       [not found]                   ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 15:45                 ` Serge Hallyn
  2012-11-06 15:45                 ` Serge Hallyn
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 15:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Eric.

On Tue, Nov 06, 2012 at 07:34:07AM -0800, Eric W. Biederman wrote:
> Having thought about this a little more I can give a definitive answer.
> 
> Adding a process to the device control group is equivalent to calling
> mknod, as it allows that process to open device nodes, or equivalently
> not open device nodes.  Therefore a capable check is absolutely
> required.
> 
> Without a capability check it would be possible to remove access to
> /dev/console for a suid root application keeping it from reporting
> attempts to hack it for example.

You understand that the whole thing is gated by VFS permission check,
right?  I'm kinda lost what you're talking about.

> The generic cgroup check in attach_task_by_pid to see if you can move
> another process into a cgroup needs to be a capability call and not a
> test for uid == 0.
> 
> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> {
> 	if (pid) {
> 		tsk = find_task_by_vpid(pid);
> 
> 		/*
> 		 * even if we're attaching all tasks in the thread group, we
> 		 * only need to check permissions on one of them.
> 		 */
> 		tcred = __task_cred(tsk);
> 		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>                                         ^^^^^^^^^^^^^^^
> 		    !uid_eq(cred->euid, tcred->uid) &&
> 		    !uid_eq(cred->euid, tcred->suid)) {
> 			rcu_read_unlock();
> 			ret = -EACCES;
> 			goto out_unlock_cgroup;

This one isn't gated by VFS so we need to add CAP check to this
function.  No?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 15:30             ` Serge Hallyn
  2012-11-06 15:41               ` Tejun Heo
@ 2012-11-06 15:41               ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 15:41 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Hello,

On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote:
> > So, you don't really have any actual use case for the explicit CAP_*
> > checks, right?
> 
> No, especially since we will now have user namespaces.
> 
> We will want to be able to strictly enforce hierarchical limits - i.e.
> allow uid 100000 (which is uid 0 in the container) to change cgroup
> settings, but never exceed limits set on the parent directory.  IIUC
> you are working toward anyway with the general hierarchy work? (thanks
> for that).

Yeah, I'm working toward that but I'm not sure it would mean that
containers would be able to directly bind mount cgroupfs subdirectory
and have free reign on it.  Maybe such thing can be made to work but I
would be much more comfortable having something inbetween for
impedance matching (in access policies, root cgroup behavior matching,
whatnot).  So, the functionality will be there but it probably would
need something inbetween if you wanna give containers control over its
own cgroup hierarchy.

There are some issues tho.  As it currently stands, devices cgroup
inherits configuration rather than enforcing hierarchy while checking
for access permission.  This means that changes in an ancestor have to
be propagated downwards and *update* configurations of descendants,
which is what I'm working on but it can be confusing for someone
inside the container.  Without breaking compatibility, I don't see any
other way out tho.  I suppose it's something we'll have to live with.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 15:30             ` Serge Hallyn
@ 2012-11-06 15:41               ` Tejun Heo
       [not found]                 ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 15:41               ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 15:41 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello,

On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote:
> > So, you don't really have any actual use case for the explicit CAP_*
> > checks, right?
> 
> No, especially since we will now have user namespaces.
> 
> We will want to be able to strictly enforce hierarchical limits - i.e.
> allow uid 100000 (which is uid 0 in the container) to change cgroup
> settings, but never exceed limits set on the parent directory.  IIUC
> you are working toward anyway with the general hierarchy work? (thanks
> for that).

Yeah, I'm working toward that but I'm not sure it would mean that
containers would be able to directly bind mount cgroupfs subdirectory
and have free reign on it.  Maybe such thing can be made to work but I
would be much more comfortable having something inbetween for
impedance matching (in access policies, root cgroup behavior matching,
whatnot).  So, the functionality will be there but it probably would
need something inbetween if you wanna give containers control over its
own cgroup hierarchy.

There are some issues tho.  As it currently stands, devices cgroup
inherits configuration rather than enforcing hierarchy while checking
for access permission.  This means that changes in an ancestor have to
be propagated downwards and *update* configurations of descendants,
which is what I'm working on but it can be confusing for someone
inside the container.  Without breaking compatibility, I don't see any
other way out tho.  I suppose it's something we'll have to live with.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
                               ` (2 preceding siblings ...)
  2012-11-06 15:34             ` Eric W. Biederman
@ 2012-11-06 15:34             ` Eric W. Biederman
  3 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-06 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hello, Serge.
>
> On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
>> More practically, lacking user namespaces you can create a full (i.e.
>> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
>> root.  So this allows you to prevent containers from bypassing devices
>> cgroup restrictions set by the parent.  (In reality we are not using
>> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
>> but other distros do.)
>
> Do you even mount cgroupfs in containers?  If you just bind-mount
> cgroupfs verbatim in containers, I don't think that's gonna work very
> well.  If not, all this doesn't make any difference for containers.
>
> So, you don't really have any actual use case for the explicit CAP_*
> checks, right?

Having thought about this a little more I can give a definitive answer.

Adding a process to the device control group is equivalent to calling
mknod, as it allows that process to open device nodes, or equivalently
not open device nodes.  Therefore a capable check is absolutely
required.

Without a capability check it would be possible to remove access to
/dev/console for a suid root application keeping it from reporting
attempts to hack it for example.

update_access can allow access to previously unaccessible devices
and so is equivalent to mknod and as such requires a capability call.

static int devcgroup_update_access(struct dev_cgroup *devcgroup,
				   int filetype, const char *buffer)
{
....
	if (!capable(CAP_SYS_ADMIN))
		return -EPERM;


Likewise move to a different cgroup can give you a completely different
set of devices you can use.  And is roughly equivalent to mknod, and
needs a capability call. 

static int devcgroup_can_attach(struct cgroup *new_cgrp,
				struct cgroup_taskset *set)
{
	struct task_struct *task = cgroup_taskset_first(set);

	if (current != task && !capable(CAP_SYS_ADMIN))
		return -EPERM;


The generic cgroup check in attach_task_by_pid to see if you can move
another process into a cgroup needs to be a capability call and not a
test for uid == 0.

static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
{
	if (pid) {
		tsk = find_task_by_vpid(pid);

		/*
		 * even if we're attaching all tasks in the thread group, we
		 * only need to check permissions on one of them.
		 */
		tcred = __task_cred(tsk);
		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
                                        ^^^^^^^^^^^^^^^
		    !uid_eq(cred->euid, tcred->uid) &&
		    !uid_eq(cred->euid, tcred->suid)) {
			rcu_read_unlock();
			ret = -EACCES;
			goto out_unlock_cgroup;

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 15:30             ` Serge Hallyn
  2012-11-06 15:30             ` Serge Hallyn
@ 2012-11-06 15:34             ` Eric W. Biederman
       [not found]               ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 15:34             ` Eric W. Biederman
  3 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-06 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge Hallyn, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hello, Serge.
>
> On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
>> More practically, lacking user namespaces you can create a full (i.e.
>> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
>> root.  So this allows you to prevent containers from bypassing devices
>> cgroup restrictions set by the parent.  (In reality we are not using
>> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
>> but other distros do.)
>
> Do you even mount cgroupfs in containers?  If you just bind-mount
> cgroupfs verbatim in containers, I don't think that's gonna work very
> well.  If not, all this doesn't make any difference for containers.
>
> So, you don't really have any actual use case for the explicit CAP_*
> checks, right?

Having thought about this a little more I can give a definitive answer.

Adding a process to the device control group is equivalent to calling
mknod, as it allows that process to open device nodes, or equivalently
not open device nodes.  Therefore a capable check is absolutely
required.

Without a capability check it would be possible to remove access to
/dev/console for a suid root application keeping it from reporting
attempts to hack it for example.

update_access can allow access to previously unaccessible devices
and so is equivalent to mknod and as such requires a capability call.

static int devcgroup_update_access(struct dev_cgroup *devcgroup,
				   int filetype, const char *buffer)
{
....
	if (!capable(CAP_SYS_ADMIN))
		return -EPERM;


Likewise move to a different cgroup can give you a completely different
set of devices you can use.  And is roughly equivalent to mknod, and
needs a capability call. 

static int devcgroup_can_attach(struct cgroup *new_cgrp,
				struct cgroup_taskset *set)
{
	struct task_struct *task = cgroup_taskset_first(set);

	if (current != task && !capable(CAP_SYS_ADMIN))
		return -EPERM;


The generic cgroup check in attach_task_by_pid to see if you can move
another process into a cgroup needs to be a capability call and not a
test for uid == 0.

static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
{
	if (pid) {
		tsk = find_task_by_vpid(pid);

		/*
		 * even if we're attaching all tasks in the thread group, we
		 * only need to check permissions on one of them.
		 */
		tcred = __task_cred(tsk);
		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
                                        ^^^^^^^^^^^^^^^
		    !uid_eq(cred->euid, tcred->uid) &&
		    !uid_eq(cred->euid, tcred->suid)) {
			rcu_read_unlock();
			ret = -EACCES;
			goto out_unlock_cgroup;

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2012-11-06 15:30             ` Serge Hallyn
@ 2012-11-06 15:30             ` Serge Hallyn
  2012-11-06 15:34             ` Eric W. Biederman
  2012-11-06 15:34             ` Eric W. Biederman
  3 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> > More practically, lacking user namespaces you can create a full (i.e.
> > ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> > root.  So this allows you to prevent containers from bypassing devices
> > cgroup restrictions set by the parent.  (In reality we are not using
> > that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> > but other distros do.)
> 
> Do you even mount cgroupfs in containers?  If you just bind-mount
> cgroupfs verbatim in containers, I don't think that's gonna work very
> well.  If not, all this doesn't make any difference for containers.

I don't know if those who restrict CAP_SYS_ADMIN do so or not.  We by
default do not.

(It's not relevant for this discussion as again we use apparmor to deny
writes, but we *do* optionally bind mount cgroups into the containers,
mounting /sys/fs/cgroup/$cgroup/lxc/$containername/$containername.real
on the host to /sys/fs/cgroup/$cgroup in the container for each cgroup.)

> So, you don't really have any actual use case for the explicit CAP_*
> checks, right?

No, especially since we will now have user namespaces.

We will want to be able to strictly enforce hierarchical limits - i.e.
allow uid 100000 (which is uid 0 in the container) to change cgroup
settings, but never exceed limits set on the parent directory.  IIUC
you are working toward anyway with the general hierarchy work? (thanks
for that).

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 15:30             ` Serge Hallyn
  2012-11-06 15:41               ` Tejun Heo
  2012-11-06 15:41               ` Tejun Heo
  2012-11-06 15:30             ` Serge Hallyn
                               ` (2 subsequent siblings)
  3 siblings, 2 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> > More practically, lacking user namespaces you can create a full (i.e.
> > ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> > root.  So this allows you to prevent containers from bypassing devices
> > cgroup restrictions set by the parent.  (In reality we are not using
> > that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> > but other distros do.)
> 
> Do you even mount cgroupfs in containers?  If you just bind-mount
> cgroupfs verbatim in containers, I don't think that's gonna work very
> well.  If not, all this doesn't make any difference for containers.

I don't know if those who restrict CAP_SYS_ADMIN do so or not.  We by
default do not.

(It's not relevant for this discussion as again we use apparmor to deny
writes, but we *do* optionally bind mount cgroups into the containers,
mounting /sys/fs/cgroup/$cgroup/lxc/$containername/$containername.real
on the host to /sys/fs/cgroup/$cgroup in the container for each cgroup.)

> So, you don't really have any actual use case for the explicit CAP_*
> checks, right?

No, especially since we will now have user namespaces.

We will want to be able to strictly enforce hierarchical limits - i.e.
allow uid 100000 (which is uid 0 in the container) to change cgroup
settings, but never exceed limits set on the parent directory.  IIUC
you are working toward anyway with the general hierarchy work? (thanks
for that).

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 15:01       ` Serge Hallyn
@ 2012-11-06 15:06         ` Tejun Heo
  2012-11-06 15:06         ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 15:06 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Aristeu Rozanski

Hello, Serge.

On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> More practically, lacking user namespaces you can create a full (i.e.
> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> root.  So this allows you to prevent containers from bypassing devices
> cgroup restrictions set by the parent.  (In reality we are not using
> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> but other distros do.)

Do you even mount cgroupfs in containers?  If you just bind-mount
cgroupfs verbatim in containers, I don't think that's gonna work very
well.  If not, all this doesn't make any difference for containers.

So, you don't really have any actual use case for the explicit CAP_*
checks, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
  2012-11-06 15:01       ` Serge Hallyn
  2012-11-06 15:06         ` Tejun Heo
@ 2012-11-06 15:06         ` Tejun Heo
       [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  1 sibling, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 15:06 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, Aristeu Rozanski,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hello, Serge.

On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> More practically, lacking user namespaces you can create a full (i.e.
> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> root.  So this allows you to prevent containers from bypassing devices
> cgroup restrictions set by the parent.  (In reality we are not using
> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> but other distros do.)

Do you even mount cgroupfs in containers?  If you just bind-mount
cgroupfs verbatim in containers, I don't think that's gonna work very
well.  If not, all this doesn't make any difference for containers.

So, you don't really have any actual use case for the explicit CAP_*
checks, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                         ` (2 preceding siblings ...)
  2012-11-06 15:01       ` Serge Hallyn
@ 2012-11-06 15:01       ` Serge Hallyn
  3 siblings, 0 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> 
> > Hello, guys.
> >
> > Why doesn't it follow the usual security enforced by cgroupfs
> > permissions?  Why is the explicit check necessary?
> 
> An almost more interesting question is why is cgroup one of the last
> pieces of code not using capabilities and instead lets you attach to any
> process simply if your uid == 0.
> 
> I don't know the history but the device cgroup testing for CAP_SYS_ADMIN
> makes a naive sort of sense to me.

Right, it's possible to run a daemon with uid 0 but restricted
capabilities (enforced by bounding set) for all its children, but
not possible to run a non-uid-0 daemon with some capabilities
across execs.  (Of course that can be worked around with privileged
helpers or to an extent inherited capabilities.)  Capabilities make
sense for the cgroup controls.

In other words, any code which equates uid 0 with posession of a
capability is taking away from the usefulness of capabilities
everywhere.

More practically, lacking user namespaces you can create a full (i.e.
ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
root.  So this allows you to prevent containers from bypassing devices
cgroup restrictions set by the parent.  (In reality we are not using
that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
but other distros do.)

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 14:48       ` Tejun Heo
  2012-11-06 14:48       ` Tejun Heo
@ 2012-11-06 15:01       ` Serge Hallyn
  2012-11-06 15:06         ` Tejun Heo
  2012-11-06 15:06         ` Tejun Heo
  2012-11-06 15:01       ` Serge Hallyn
  3 siblings, 2 replies; 37+ messages in thread
From: Serge Hallyn @ 2012-11-06 15:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> 
> > Hello, guys.
> >
> > Why doesn't it follow the usual security enforced by cgroupfs
> > permissions?  Why is the explicit check necessary?
> 
> An almost more interesting question is why is cgroup one of the last
> pieces of code not using capabilities and instead lets you attach to any
> process simply if your uid == 0.
> 
> I don't know the history but the device cgroup testing for CAP_SYS_ADMIN
> makes a naive sort of sense to me.

Right, it's possible to run a daemon with uid 0 but restricted
capabilities (enforced by bounding set) for all its children, but
not possible to run a non-uid-0 daemon with some capabilities
across execs.  (Of course that can be worked around with privileged
helpers or to an extent inherited capabilities.)  Capabilities make
sense for the cgroup controls.

In other words, any code which equates uid 0 with posession of a
capability is taking away from the usefulness of capabilities
everywhere.

More practically, lacking user namespaces you can create a full (i.e.
ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
root.  So this allows you to prevent containers from bypassing devices
cgroup restrictions set by the parent.  (In reality we are not using
that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
but other distros do.)

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-11-06 14:48       ` Tejun Heo
@ 2012-11-06 14:48       ` Tejun Heo
  2012-11-06 15:01       ` Serge Hallyn
  2012-11-06 15:01       ` Serge Hallyn
  3 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 14:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Eric.

On Tue, Nov 06, 2012 at 03:58:04AM -0800, Eric W. Biederman wrote:
> > Why doesn't it follow the usual security enforced by cgroupfs
> > permissions?  Why is the explicit check necessary?
> 
> An almost more interesting question is why is cgroup one of the last
> pieces of code not using capabilities and instead lets you attach to any
> process simply if your uid == 0.

Because it has a filesystem as interface and most things w/ file
system interface depend on VFS for access policies.

> I don't know the history but the device cgroup testing for CAP_SYS_ADMIN
> makes a naive sort of sense to me.

If some different CAP_* is needed for cgroup (but why?), the right
thing to do is enforcing it uniformly from cgroup core instead of
doing it from individual controllers.  If there's no actual good
reason to keep this, I'll write up a patch to remove it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-11-06 14:48       ` Tejun Heo
  2012-11-06 14:48       ` Tejun Heo
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-11-06 14:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Serge E. Hallyn,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hey, Eric.

On Tue, Nov 06, 2012 at 03:58:04AM -0800, Eric W. Biederman wrote:
> > Why doesn't it follow the usual security enforced by cgroupfs
> > permissions?  Why is the explicit check necessary?
> 
> An almost more interesting question is why is cgroup one of the last
> pieces of code not using capabilities and instead lets you attach to any
> process simply if your uid == 0.

Because it has a filesystem as interface and most things w/ file
system interface depend on VFS for access policies.

> I don't know the history but the device cgroup testing for CAP_SYS_ADMIN
> makes a naive sort of sense to me.

If some different CAP_* is needed for cgroup (but why?), the right
thing to do is enforcing it uniformly from cgroup core instead of
doing it from individual controllers.  If there's no actual good
reason to keep this, I'll write up a patch to remove it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
       [not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2012-11-06 11:58   ` Eric W. Biederman
       [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2012-11-06 11:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hello, guys.
>
> Why doesn't it follow the usual security enforced by cgroupfs
> permissions?  Why is the explicit check necessary?

An almost more interesting question is why is cgroup one of the last
pieces of code not using capabilities and instead lets you attach to any
process simply if your uid == 0.

I don't know the history but the device cgroup testing for CAP_SYS_ADMIN
makes a naive sort of sense to me.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
@ 2012-11-06  2:38 Tejun Heo
       [not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-11-06  2:38 UTC (permalink / raw)
  To: Aristeu Rozanski, Serge E. Hallyn
  Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, guys.

Why doesn't it follow the usual security enforced by cgroupfs
permissions?  Why is the explicit check necessary?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-11-06 18:25 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-06  2:38 Why does devices cgroup check for CAP_SYS_ADMIN explicitly? Tejun Heo
2012-11-06  2:38 Tejun Heo
     [not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 11:58   ` Eric W. Biederman
     [not found]     ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 14:48       ` Tejun Heo
2012-11-06 14:48       ` Tejun Heo
2012-11-06 15:01       ` Serge Hallyn
2012-11-06 15:06         ` Tejun Heo
2012-11-06 15:06         ` Tejun Heo
     [not found]           ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 15:30             ` Serge Hallyn
2012-11-06 15:41               ` Tejun Heo
     [not found]                 ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 16:12                   ` Aristeu Rozanski
2012-11-06 16:12                   ` Aristeu Rozanski
2012-11-06 15:41               ` Tejun Heo
2012-11-06 15:30             ` Serge Hallyn
2012-11-06 15:34             ` Eric W. Biederman
     [not found]               ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 15:43                 ` Tejun Heo
     [not found]                   ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 16:10                     ` Eric W. Biederman
     [not found]                       ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 16:52                         ` Tejun Heo
     [not found]                           ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 17:31                             ` Serge Hallyn
2012-11-06 17:31                             ` Serge Hallyn
2012-11-06 17:38                               ` Tejun Heo
     [not found]                                 ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 17:41                                   ` Tejun Heo
     [not found]                                     ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 18:02                                       ` Serge Hallyn
2012-11-06 18:08                                         ` Tejun Heo
2012-11-06 18:08                                         ` Tejun Heo
2012-11-06 18:02                                       ` Serge Hallyn
2012-11-06 18:12                                   ` Serge Hallyn
2012-11-06 18:16                                     ` Tejun Heo
     [not found]                                       ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 18:25                                         ` Serge Hallyn
2012-11-06 18:16                                     ` Tejun Heo
2012-11-06 18:12                                   ` Serge Hallyn
2012-11-06 17:38                               ` Tejun Heo
2012-11-06 16:10                     ` Eric W. Biederman
2012-11-06 15:45                 ` Serge Hallyn
2012-11-06 15:45                 ` Serge Hallyn
2012-11-06 15:34             ` Eric W. Biederman
2012-11-06 15:01       ` Serge Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.